Processing systems implement many applications that operate on sparse datasets that are subsets of (or representative of) a much larger scale complete data set. For example, a volumetric flow simulation of smoke rising from a match in a large space can be represented with a sparse dataset including cells that represent a small volume of the large space that is occupied by the smoke. The number of cells in the sparse dataset may increase as the smoke diffuses from a small region near the match into the large space and consequently occupies an expanding volume in the space. For another example, propagation of light through a large space often is represented by a sparse dataset that includes cells that represent an illuminated volume within the large space. For yet another example, textures used for graphics rendering are stored as a mipmap with multiple levels of detail or the textures are generated on the fly. In either case, texture information is only applied to surfaces of objects in a scene that are visible from the point of view of a “camera” that represents the location of a viewer of the scene during the current frame. Thus, textures are only generated or retrieved from a remote memory and stored in a local memory for a sparse subset of surfaces, levels of detail, or other characteristics that define the textures that are applied to the visible surfaces. For yet another example, visualization systems such as flight simulators or marine simulators may consume or generate a terrain representation that includes large mega-scale and giga-scale partially-resident textures (PRT), which are locally stored portions of a complete set of textures that are stored in a remote memory.
The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.
Virtual memory systems are used to allocate physical memory locations, such as pages in a local memory, to sparse datasets that are currently being used by the processing system instead of allocating local memory to the complete data set. For example, a central processing unit (CPU) in the processing system manages a virtual memory page table implemented in a graphics processing unit (GPU) in the processing system. The CPU allocates pages in the local memory to a sparse data set that stores texture data that is expected to be used to render a scene in one or more subsequent frames. The CPU configures the virtual memory page table to map a subset of the virtual memory addresses used by applications running on the GPU to the allocated physical pages of the local memory. A cache hierarchy may be used to cache recently accessed or frequently accessed pages of the local memory. Compute units in the GPU are then able to access the sparse data set using memory access requests that include the virtual memory addresses. The memory access requests may be sent to address translation caches associated with the cache hierarchy and the local memory. The address translation caches store frequently accessed mappings of the virtual memory addresses to physical memory addresses. Since the dataset stored in the local memory (and the corresponding caches) is sparse, some variations of the compute units in the GPU generate memory access requests to virtual memory addresses that are not mapped to pages of the local memory. In a conventional processing system, a memory access request to an unmapped virtual memory address results in a page fault, which typically causes a very high latency interrupt in processing.
Conventional processing systems implement different techniques for recovering from page faults. For example, CPUs that implement full virtual memory subsystems use “fault-and-switch” techniques to stall the thread that generated the page fault while the requested page is generated or retrieved from a remote memory. In addition, the local memory must be configured to store the requested page, e.g., by rearranging previously stored memory pages to provide space for the requested page. Stalling the thread may also preempt the thread to allow another thread to execute on the processor core. Fault-and-switch techniques therefore often introduce unacceptably long delays for stalled threads in heavily parallel workloads or massively deep, fixed function pipelines such as graphics processing pipelines implemented in GPUs. In order to avoid page faults, some conventional processing systems populate the sparse datasets in the local memory ahead of time (e.g., one or more frames prior to the frame to be rendered using the sparse dataset) using conservative assumptions that require speculatively generating or retrieving larger amounts of data that may or may not be accessed by the workloads. Typically, much of the pre-populated sparse dataset is never used by the workload. Additional system resources are consumed by fallbacks that are created to handle incorrect predictions of the required data and blending that is used to hide “popping” that occurs if the requested data becomes available after a relatively long latency. Furthermore, large latencies are introduced when virtual pages are remapped to physical pages in response to changes in the sparse dataset that is stored in the local memory.
The long latencies incurred by moving portions of a dataset from a remote memory to a local memory implemented by a processing unit such as a graphics processing unit (GPU) can be reduced by allocating one or more physical pages in the local memory to a free page pool associated with an application executing on another processing unit such as a central processing unit (CPU). Physical pages in the free page pool are mapped to a virtual address in response to a memory access request that would otherwise have a page fault. For example, a physical page in the free page pool is mapped to a virtual address in a write command that is used to write data to the virtual address. The physical page is initialized (e.g., to all zeros so that the data in the physical page is in a known state) and the write command writes the data to the physical page that has been mapped to the virtual address. For another example, if a read command attempts to read texture data at a virtual address that is not mapped to a physical address in the local memory, the GPU spawns a process to compute locally the requested texture data and writes the requested texture data to a physical page from the free page pool that is mapped to the virtual address in the read command. Physical pages can be unmapped and returned to the free page pool, e.g., based on information such as access bits or dirty bits that indicate how frequently the physical pages are being utilized. Mapping of the physical pages in the free page pool to virtual addresses, initialization of the physical pages, or unmapping of the physical pages and returning the physical pages to the free page pool is performed by hardware, firmware, or software in the GPU instead of requiring the application (or an associated kernel mode driver) implemented by the CPU to allocate or deallocate physical pages in the local memory.
The processing device 105 includes a memory controller (MC) 115 that is used to coordinate the flow of data between the processing device 105 and the DRAM 110 over a memory interface 120. The memory controller 115 includes logic used to control reading information from the DRAM 110 and writing information to the DRAM 110. The processing units 111-114 communicate with each other, with the memory controller 115, or with other entities in the processing system 100 using a bus 125. For example, the processing units 111-114 can include a physical layer interface or bus interface for asserting signals onto the bus 125 and receiving signals from the bus 125 that are addressed to the corresponding processing unit 111-114. Some embodiments of the processing device 105 also include one or more interface blocks or bridges such as a northbridge or a southbridge for facilitating communication between entities in the processing device 105.
The processing device 105 implements an operating system (OS) 130. Although a single instance of the OS 130 is shown in
Some embodiments of the processing device 105 perform graphics processing to render scenes represented by a 3-D model to generate images for display on a screen 145. For example, the DRAM 110 stores a dataset including information representative of the 3-D model. However, in some cases a latency for conveying information between the GPU 114 and the DRAM 110, e.g., via the bus 125 or the memory interface 120, is too large to allow the GPU 114 to render images at a sufficiently high rate to provide a smooth viewing experience for a user.
A local memory system 150 is connected to the GPU 114 by an interconnect that does not include the interface 120 or the bus 125. Consequently, a latency for accessing information stored in the local memory system 150 is lower than a latency for accessing information stored in the DRAM 110. For example, the latency between the GPU 114 and the local memory system 150 is low enough to allow the GPU 114 to render images at a sufficiently high rate to provide a smooth viewing experience for the user. The local memory system 150 stores a subset of the dataset stored in the DRAM 110. For example, the local memory system 150 may store a sparse dataset that includes (or is representative of) the subset of the dataset stored in the DRAM 110. The subset that is stored in the local memory system 150 is retrieved or copied from the DRAM 110, or the information in the subset is generated in response to memory access requests, as discussed herein. The GPU 114 uses the information stored in the local memory system 150 to render portions of the scene to generate an image for display on the screen 145. The GPU 114 transmits information representative of the rendered images to the screen 145 via the bus 125. Although not shown in
The local memory system 150 implements virtual addressing so that memory access requests from the GPU 114 refer to virtual addresses, which are translated to physical addresses of physical pages in the local memory system 150 or the DRAM 110. Memory access requests including virtual addresses are provided to the local memory system 150 by the GPU 114. The local memory system 150 determines whether the virtual address is mapped to a physical page in the local memory system 150 or a physical page in the DRAM 110. If the virtual address is mapped to a physical page in the local memory system 150, the local memory system 150 grants access to the physical page, e.g., to read information from the physical page or write information to the physical page based on the virtual address. However, if the virtual address is not mapped to physical page in the local memory system 150, the local memory system 150 maps a physical page from a free page pool implemented in the local memory system 150 to the virtual address in the memory access request and grants access to the physical page that is mapped to the virtual address. Thus, the local memory system 150 avoids causing a page fault, which would typically occur if the GPU 114 attempted to access a virtual address that was not mapped to a physical page in the local memory system 150.
The local memory system 210 is used to store a subset of a complete dataset that is stored in a remote memory such as the DRAM 110 shown in
A page table 250 is included in the GPU 205 and used to store mappings of virtual addresses to physical addresses of physical pages in the local memory system 210 or a remote memory such as the DRAM 110 shown in
Some embodiments of the local memory system 210 also implement a cache hierarchy (not shown in
A command processor 265 receives commands from the application 215 and executes the commands using the resources of the GPU 205. The commands include draw commands and compute dispatch commands, which cause the command processor 265 to generate memory access requests. For example, the command processor 265 can receive a command from the application 215 instructing the GPU 205 to write information to a memory location indicated by a virtual address included in the write command. For another example, the command processor 265 can receive a command from the application 215 instructing the GPU 205 to read information from a memory location indicated by a virtual address included in the read command. The application 215 is also configured to coordinate operation with the kernel mode driver 220 to allocate memory in the local store 240, as well as add or monitor physical pages in the free page pool 245.
As the commands are executed, the graphics engines 225 translate virtual addresses to physical addresses using the address translation buffers 260-264. However, in some cases the address translation buffers 260-264 do not include a mapping for a virtual address, in which case a memory access request to the virtual address misses in the address translation buffers 260-264. The address translation buffer 264 is therefore configured to add physical pages from the free page pool 245 to the local store 240 and map the added physical page to a corresponding virtual address. For example, the address translation buffer 264 pulls a free physical page from the free page pool 245 when a dynamically allocated surface (e.g., a surface generated by the geometry engine 230) touches a page of virtual memory that has not yet been mapped to a physical page. The address translation buffer 264 then updates the page table 250 with the new virtual-to-physical mapping and removes the information indicating the physical page from the free page table 255. Some embodiments of the address translation buffer 264 are configured to serialize and manage the free page pool 245 and update the page table 250 concurrently with mapping the physical pages from the free page pool 245 to virtual addresses. Some embodiments of the page table 250 also store access bits and dirty bits associated with each of the allocated physical pages. The application 215 or the kernel mode driver 220 are able to use the values of the access bits or the dirty bits to select physical pages that are available to be reclaimed and added to the free page pool 245, e.g., by unmapping the physical-to-virtual address mapping and updating the page table 250 and the free page table 255 accordingly.
The local memory 305 includes a local store 320 of physical pages 325, which are allocated to applications such as the application 215 shown in
Mapped physical pages can also be reclaimed from the local store 320 and returned to the free page pool 330, as indicated by the arrow 345. For example, an application or a kernel mode driver such as the application 135 and the kernel mode driver 140 shown in
The local memory 410 includes a local store 420 of physical pages that are addressed by physical addresses such as PADDR_1, PADDR_2, PADDR_3, and PADDR_M. The local memory 410 also includes a free page pool 425 of physical pages addressed by physical addresses such as PADDR_X, PADDR_Y, and PADDR_Z. The physical pages and the corresponding physical addresses in the local store 420 and the free page pool 425 may or may not be contiguous with each other. Although not shown in
The address translation buffer 405 indicates mappings of virtual addresses to the physical addresses in the local memory 410 and the remote memory 415. The address translation buffer 405 includes a set of virtual addresses (VADDR_1, VADDR_2, VADDR_3, and VADDR_N) and corresponding pointers that indicate the physical addresses that are mapped to the virtual addresses. For example, the address translation buffer 405 indicates that the virtual address VADDR_1 is mapped to the physical address PADDR_1 in the local store 420, the virtual address VADDR_3 is mapped to the physical address PADDR_M in the local store 420, and the virtual address VADDR_N is mapped to the physical address PADDR_3 in the local store 420. The address translation buffer 405 also indicates that the virtual address VADDR_2 is mapped to a physical address in the remote memory 415. In some variations, some of the virtual addresses in the address translation buffer 405 are not mapped to a physical address.
Memory access requests that include the virtual addresses VADDR_1, VADDR_2, and VADDR_N will hit in the address translation buffer 405 because these virtual addresses are mapped to physical addresses in the local memory 410. Memory access requests that include the virtual address VADDR_2 will miss in the address translation buffer 405 because this virtual address is mapped to a physical address in the remote memory 415. A miss in the address translation buffer 405 would lead to a page fault in a conventional processing system. However, the processing system that includes the portion 400 is configured to allocate a physical page from the free page pool 425 and map the physical page into a virtual address indicated in the address translation buffer 405 instead of causing a page fault. Physical pages from the free page pool 425 can be added to the local store 420 in response to a miss in the address translation buffer 405 that occurs because the corresponding virtual address is not mapped to any physical address.
At block 610, a write command is received that includes a virtual address indicating a location that is to be written by the write command. At decision block 615, the address translation buffer determines whether the virtual address is mapped to a physical address of a physical page in the local memory. If so, the address translation buffer translates the virtual address to the physical address that indicates the physical page in the local memory at block 620. The information indicated by the write command is then written (at block 625) to the physical page indicated by the physical address that corresponds to the virtual address in the write command. If the virtual address is not mapped to a physical address of a physical page in the local memory, the method flows to block 630.
At block 630, a physical address of a physical page in the free page pool is mapped to the virtual address. The physical page is therefore removed from the free page pool and added to the local store. In some embodiments, mapping the physical address to the virtual address includes updating page tables, free page tables, and other address translation buffers to reflect the mapping of the physical page to the virtual address. At block 635, the physical page is initialized to a known state such as all zeros. At block 640, the write command writes information to the physical page based on the virtual address. In some embodiments, the information that is written to the physical page is propagated to other memories or caches based on memory or cache coherence protocols.
Implementing on-demand allocation of physical pages from the free page pool in response to write commands improves the speed and efficiency of the processing system. For example, applications such as physical simulations or games frequently determine the physical pages that need to be written during execution of the application as the physical pages are being written by the application. On-demand allocation of the physical pages removes the need to perform a pre-pass to estimate the number and addresses of the physical pages that could potentially be written in the future. This approach also conserves memory by removing the need to conservatively estimate the number of physical pages that could potentially be written by the application during a particular time interval, which typically leads to numerous physical pages being loaded into the local store and mapped to virtual addresses, but never accessed.
At block 710, a read command is received that includes a virtual address indicating a location that is to be read by the read command. At decision block 615, the address translation buffer determines whether the virtual address is mapped to a physical address of a physical page in the local memory. If so, the address translation buffer translates the virtual address to the physical address that indicates the physical page in the local memory at block 720. The physical page indicated by the virtual address in the read command is then read (at block 725) from the physical page indicated by the physical address that corresponds to the virtual address in the read command. If the virtual address is not mapped to a physical address of a physical page in the local memory, the method flows to block 730.
At block 730, a new local compute process is spawned to generate the data that is to be read. Some embodiments of the process are executed concurrently or in parallel with the current process that includes the read command. For example, if the read command is being executed by one graphics engine implemented by the GPU, the newly spawned process is executed by another graphics engine implemented by the GPU.
At block 735, a physical address of a physical page in the free page pool is mapped to the virtual address in the read command. The physical page is therefore removed from the free page pool and added to the local store. In some embodiments, mapping the physical address to the virtual address includes updating page tables, free page tables, and other address translation buffers to reflect the mapping of the physical page to the virtual address. At block 740, the spawned process writes the computed data to the physical page indicated by the virtual address. In some variations, an application such as the application 135 shown in
Implementing on-demand allocation of physical pages from the free page pool in response to read commands (e.g., by spawning a concurrent process to write the requested information to a physical page pool from the free page pool) improves the speed and efficiency of the processing system. For example, in the case of a GPU that is performing rendering that requires reading sparse texture data that is generated by the GPU, only the physical pages that correspond to surfaces rendered by the GPU are generated, which reduces the consumed memory by eliminating the need to conservatively estimate the physical pages that need to be generated in advance. In addition, the physical pages that are generated and mapped to virtual addresses on-demand do not require interaction with or intervention by one or more CPUs such as the CPUs 111-113 shown in
At block 805, the application requests allocation of physical pages with virtual memory that is associated with the application. The kernel mode driver receives the request and allocates the physical pages to the virtual memory. For example, the application can allocate physical memory by requesting a virtual memory allocation that is mapped to physical memory. The physical pages that have been allocated to the virtual memory can therefore be accessed using a corresponding range of (first) virtual addresses. At block 810, the application or the kernel mode driver initializes the physical pages in the local store using the first virtual addresses. In some variations, the application or the kernel mode driver initializes the physical pages to a known state, e.g., by initializing the physical pages to all zeros. Initializing the physical pages to a known state allows the application to write to only a small subset of a physical page after the physical page has been dynamically allocated from the free page pool because the remainder of the page has been initialized to the known value.
At block 815, the application requests allocation of a second set of virtual addresses to a subset of the virtual memory. At this point in the method 800, the second set of virtual addresses do not have physical memory mapped to them. Thus, the application is not able to directly access physical pages using second virtual addresses in the second set.
At block 820, the application requests that the kernel mode driver add physical memory to the application's free page pool. Since the application has only been allocated the second set of virtual addresses, which are not yet mapped to physical memory, the application indicates the physical pages that are to be added to the free page pool using first virtual addresses in the first virtual address range that was allocated at block 805. The kernel mode driver translates the first virtual addresses into physical addresses of physical pages that are then added to the application's free page pool. Once the physical pages have been added to the free page pool, they are mapped to corresponding second virtual addresses in the second set of virtual addresses. In some embodiments, the number of physical pages that are initially allocated to the free page pool is predetermined and the number is determined based upon analysis of previous allocations of physical pages to the free page pool. As discussed herein, physical pages can be pulled from the free page pool into the local store in response to read or write misses. Pulling a physical page from the free page pool into the local store is done by changing the virtual address of the physical page from the second virtual address that references the free page pool to a new virtual address that references the local store. Thus, the actual number of physical pages that are available in the free page pool fluctuates as physical pages are added to the local store or reclaimed from the local store, as discussed herein.
At block 905, the application (or kernel mode driver) accesses information indicating a number of physical pages that are in the free page pool. For example, the processing system can maintain one or more registers that are incremented in response to adding physical pages to the free page pool, allocating physical pages to the free page pool, or reclaiming physical pages for the free page pool. The registers are decremented in response to pulling physical pages from the free page pool, mapping them to virtual addresses, and adding the mapped physical pages to a local store.
At decision block 910, the application (or kernel mode driver) compares the number of physical pages in the free page pool to a threshold value. If the number is greater than the threshold value, indicating that a sufficient number of physical pages are available in the free page pool, the application (or kernel mode driver) maintains (at block 915) the current mapping of virtual addresses to physical pages and does not reclaim any physical pages for the free page pool. If the number is less than the threshold value, the method flows to block 920.
At block 920, the application (or kernel mode driver) unmaps virtual addresses of one or more physical pages that are included in the local store in the local memory. Some embodiments of the application (or kernel mode driver) select physical pages in the local store for unmapping based on access bits or dirty bits included in the physical pages, as discussed herein. At block 930, the application (or kernel mode driver) as the unmapped physical pages to the free page pool so that the unmapped physical pages are available for on-demand allocation.
Physical pages can be reclaimed and added to the free page pool at any time. Pulling physical pages from the free page pool or adding them to the free page pool does not change the allocated memory or virtual addresses, it just changes the physical pages that are in the free page list. In some variations, only virtual addresses that are currently unmapped and flagged as wanting dynamic allocation are mapped by the hardware to pages in the free page pool. The application should not access pages in the free page pool using virtual address synonyms such as the original direct virtual address otherwise incoherent results may occur. Also, the kernel mode driver should wait until all potential dynamic allocations have been suspended before changing or removing pages from its free page list.
In some embodiments, certain aspects of the techniques described above are implemented by one or more processors of a processing system executing software. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software includes the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium includes, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like. The executable instructions stored on the non-transitory computer readable storage medium are implemented in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
A computer readable storage medium includes any storage medium, or combination of storage media, accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media include, but are not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc, magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.