A platform for a conventional graphics processing system includes a central processing unit (CPU), a graphics processing unit (GPU), one or more system memories (such as a dynamic random access memory, DRAM), and a bus to support communication between these entities. In some cases, the platform is implemented as a system-on-a-chip (SoC). The CPU initiates graphics processing by issuing draw calls to the GPU. In response to receiving a draw call, the GPU renders images for display using a pipeline formed of a sequence of programmable shaders and fixed-function hardware blocks. The system memory in the conventional graphics processing system is partitioned into a first portion that is visible to a host operating system (OS) executing on the graphics processing system and a second portion that is dedicated to the GPU, e.g., to provide a frame buffer. The second portion, which is sometimes referred to as a carveout or a GPU carveout, is not visible to the host OS. A GPU virtual manager (VM), which is managed by a graphics device driver, translates the virtual addresses in memory access requests to physical addresses in the system memory such as physical addresses in the GPU carveout region of the system memory. In some cases, the GPU VM performs the address translation using a corresponding translation lookaside buffer (TLB) that caches frequently requested address translations from a page table.
The present disclosure is better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.
Changes in the security infrastructure and requirements driven by vendors of operating systems such as Microsoft Windows® are expected to impact the memory access performance of the processing system. For example, the size of the GPU carveout may be reduced to increase the amount of memory available for dynamic allocation to the GPU from the OS-controlled portion of the memory. For another example, virtualization-based security (VBS) provides memory protection against kernel mode malware by creating a secure partition in the system memory that is accessed using a first address translation layer managed by a device driver, e.g., using page tables and translation lookaside buffers (TLBs) to cache frequently requested address translations from the page tables. The page table and TLBs are associated with a GPU virtual manager (VM). A second address translation layer used to access the secure partition is controlled by a hypervisor or secure OS. The first address translation layer is contiguous and high performance. The second address translation layer handles physical memory management challenges such as memory fragmentation and access security. Consequently, the second address translation layer typically determines overall address translation performance. The second address translation layer is implemented in a system-wide input/output memory management unit (IOMMU) that supports address translation and system memory access protection on direct memory access (DMA) transfers from devices including the GPU and one or more peripheral devices.
In response to receiving a memory access request from the GPU, the first address translation layer translates a device-generated address in the memory access request to a domain physical address. The second address translation layer implemented in the IOMMU translates the domain physical address into a system physical address in the system memory. For example, the IOMMU assigns a domain context and a distinct set of page tables to each device in the processing system. When a device attempts to read or write system memory, the IOMMU intercepts the access and determines the domain context to which the device has been assigned. Additional permissions like read, write, execute, and the like are encoded into entries in the page tables and TLBs that are used to perform the second layer translation. The IOMMU therefore uses the TLB entries associated with the domain or the page tables associated with the device to determine whether the access is to be permitted and the location in system memory that is to be accessed. For example, in response to determining that the memory access request from the device is permitted, the IOMMU generates a physical address in the system memory from a domain physical address generated by the first address translation layer.
Funneling all memory access requests from peripheral devices and the GPU through the IOMMU leads to several problems. For example, the IOMMU provides service to real-time-dependent device client blocks such as video decoders, video encoders, and display framebuffer scanout circuitry, which have strict latency requirements. Performing page tablewalks for memory access requests from multiple entities at a single IOMMU introduces processing delays that increase latency. Moreover, a single IOMMU cannot be positioned near all of the peripheral devices and the GPU, so round trip times between some of the entities and the IOMMU further increase the processing latency at the IOMMU. Consequently, a central IOMMU cannot service all memory requests close to the single IOMMU and within hard access deadlines, e.g., with low latency. A system of distinct and disparate IOMMUs could be deployed proximate the different devices or the GPU. However, providing programming support for device-specific IOMMUs requires different programming models in system software, complicating the host OS and other system software architectures that use the IOMMU as a software-targeted system device.
In response to receiving a memory access request including a domain physical address from a first translation layer, the primary IOMMU selectively performs an address translation of the domain physical address or bypasses the address translation based on the type of device that provided the memory access request. In some embodiments, the primary IOMMU performs address translations of domain physical addresses associated with memory access requests from the GPU by performing a page tablewalk using a first set of page tables and a first translation lookaside buffer (TLB) associated with the primary IOMMU. The primary IOMMU bypasses the address translations of domain physical addresses in memory access requests received from peripheral devices. Instead, the primary IOMMU provides the memory access requests to a secondary IOMMU associated with the peripheral device that provided the memory access request. The secondary IOMMU performs address translations of domain physical addresses by performing page tablewalks using a second set of page tables and a second TLB associated with the second IOMMU. Some embodiments of the primary IOMMU include (or are associated with) a command queue that receives commands associated with the primary and secondary IOMMUs. The command queue allows system software to initiate page tablewalks and device rescans, which are processed in the primary IOMMU or selectively forwarded to one of the secondary IOMMUs, as discussed above. The command queue also supports rescan and synchronization of system software with the peripheral devices to ensure that software doesn't modify table data that is currently in flight.
The processing system 100 includes a graphics processing unit (GPU) 115 that renders images for presentation on a display 120. For example, the GPU 115 renders objects to produce values of pixels that are provided to the display 120, which uses the pixel values to display an image that represents the rendered objects. Some embodiments of the GPU 105 include multiple processing elements (not shown in
Some embodiments of the GPU 115 perform virtual-to-physical address translations using a GPU VM 116 and one or more corresponding TLBs 117 (only one TLB 117 is shown in
The processing system 100 also includes a central processing unit (CPU) 130 that implements multiple processing elements 131, 132, 133, which are collectively referred to herein as “the processing elements 131-133.” The processing elements 131-133 execute instructions concurrently or in parallel. The CPU 130 is connected to the bus 110 and communicates with the GPU 115 and the memory 105 via the bus 110. The CPU 130 executes instructions such as program code 135 stored in the memory 105 and the CPU 130 stores information in the memory 105 such as the results of the executed instructions. The CPU 130 is also able to initiate graphics processing by issuing draw calls to the GPU 115.
An input/output (I/O) engine 140 handles input or output operations associated with the display 120, as well as other elements of the processing system 100 such as keyboards, mice, printers, external disks, and the like. In the illustrated embodiment, the I/O engine 140 also handles input and output operations associated with a camera 145. The I/O engine 140 is coupled to the bus 110 so that the I/O engine 140 is able to communicate with the memory 105, the GPU 115, or the CPU 130. In the illustrated embodiment, the I/O engine 140 reads information stored on an external storage component 150, which is implemented using a non-transitory computer readable medium such as a compact disk (CD), a digital video disc (DVD), and the like. The I/O engine 140 also writes information to the external storage component 150, such as the results of processing by the GPU 115 or the CPU 130.
The processing system 100 includes a networked I/O memory management unit (IOMMU) 155 that includes a set of IOMMUs for processing memory access requests from devices such as the GPU 115 and peripheral devices including the display 120, the camera 145, and the external storage component 150. The memory access requests include a device-generated address such as a virtual address that is used to indicate a location in the system memory 105. Some embodiments of the networked IOMMU 155 receive memory access requests that include a domain physical address generated by a first address translation layer that is managed by a driver such as a graphics driver implemented by the GPU 115. For example, the first address translation layer can include the GPU VM 116 and the TLB 117. The networked IOMMU 155 selectively translates the domain physical address into a physical address in the system memory 105 using one of the set of IOMMUs that is selected based on a type of a device that generated the memory access request. The types include a first type for the GPU 115 and a second type for peripheral devices such as the display 120, the camera 145, and the external storage component 150.
In the illustrated embodiment, the networked IOMMU 135 includes a primary IOMMU 160 that receives the memory access requests from the first address translation layer and secondary IOMMUs 165, 170 connected to the primary IOMMU 160 and disposed proximate to circuitry (not shown in
The networked IOMMU 135 performs address translations using address translations that are stored in page tables 180. Each process that is executing on a device in the processing system 100 has a corresponding page table. The page table 180 for a process translates the device-generated (e.g., virtual) addresses that are being used by the process to physical addresses in the system memory 105. The primary IOMMU 160 and the secondary IOMMUs 165, 170 independently perform tablewalks of the page tables 180 to determine translations of addresses in the memory access requests. Translations that are frequently used by the networked IOMMU 135 are stored in TLBs 185, which are used to cache frequently requested address translations. Separate TLBs 185 are associated with the primary IOMMU 160 and the secondary IOMMUs 165, 170. Entries including frequently used address translations are written from the page tables 180 into the TLBs 185 for the primary IOMMU 160 and the secondary IOMMUs 165, 170. The primary IOMMU 160 and the secondary IOMMUs 165, 170 are therefore independently able to access the address translations from the TLB 185 without the overhead of searching for the translation in the page table 180. Entries are evicted from the TLBs 185 to make room for new entries according to a TLB replacement policy. The TLB 185 is depicted as an integrated part of the networked IOMMU 135 in
The portion 300 of the processing system includes a networked IOMMU 325 to translate device-generated addresses in memory access requests to physical addresses in the GPU partition 320 or the portions 321-323. For example, a GPU VM and associated TLB can translate a virtual memory address in a memory access request to a domain physical address and provide the memory access request including the domain physical address to the networked IOMMU 325. In some embodiments, page tables are defined in response to allocation of the portion 321-323 to processes executing on the GPU 305. For example, virtual addresses used by a process executing on the GPU 305 are mapped to physical addresses in the portion 321 that is allocated to the process. The mapping is stored in entries of the page table associated with the process. The networked IOMMU 325 includes a set of IOMMUs and the networked IOMMU 325 selectively translates the domain physical address into a physical address in the system memory 310 using one of the set of IOMMUs that is selected based on a type of a device that generated the memory access request. For example, a primary IOMMU in the set of IOMMUs translates the domain physical address into the physical address in the system memory 310 in response to receiving a memory access request from the GPU 305. For another example, the primary IOMMU bypasses the translation and provides the memory access request to a secondary IOMMU for translation in response to receiving a memory access request from a peripheral device such as a display or camera.
The memory access request includes a device-generated address such as a virtual address used by an application executing on or associated with the device 405. In the illustrated embodiment, a virtualization based security (VBS) provides memory protection (e.g., against kernel mode malware) using a two-level translation process that includes a first level translation 415 managed by an OS or device driver 420 and a second layer translation 425 managed by a hypervisor 430. The first level translation 415 translates a device-generated address such as a virtual address in the memory access request to a domain physical address such as a GPU physical address. In some embodiments, the first level translation 415 is performed by a GPU VM and associated TLB, as discussed herein. The domain physical address is passed to the second level translation 425, which translates the domain physical address into a physical address that indicates a location within the system memory 410. As discussed herein, the second level translation 425 also verifies that the device 405 is authorized to access the region of the system memory 410 indicated by the physical address, e.g., using permission information that is encoded into entries in associated page tables and translation lookaside buffers (TLBs) that are used to perform the second layer translation 425.
The networked IOMMU 505 receives memory access requests via a unified software interface 510 to a primary IOMMU 515. In the illustrated embodiment, the memory access requests are provided by software such as an IOMMU driver 520 that is implemented in the processing system. The IOMMU driver 520 receives the memory access requests from a first address translation layer, e.g., an address translation layer that includes a GPU VM and associated TLB (not shown in
The primary IOMMU 515 and the unified software interface 510 support an architected programming model that is targeted as a single device by system software (such as the IOMMU driver 520). Thus, the programming model does not require dedicated control mechanisms and software to operate disparate IOMMU hardware units for real time and conventional direct memory access (DMA) processing. However, some device client blocks that require IOMMU services to satisfy worst case latency requirements, e.g., for video decoder in code, display frame buffer scan out, and the like. A single primary IOMMU 515 is not always able to satisfy the latency requirements.
At least in part to address the worst case latency requirements of device client blocks, the networked IOMMU 505 includes one or more secondary IOMMU 535, 536 that are deployed proximate corresponding device client blocks for peripheral devices. In the illustrated embodiment, the peripheral device circuitry includes display circuitry 540 that supports communication with a display such as the display 120 shown in
In operation, the primary IOMMU 515 performs address translations for memory access requests from devices of the first type (e.g., requests from a GPU) and bypasses performing address translations for memory access requests from devices of a second type (e.g., requests from peripheral devices). The primary IOMMU 515 forwards memory access requests from devices of the second type to corresponding secondary IOMMUs 535, 536. For example, the primary IOMMU 515 forwards memory access requests associated with a display to the secondary IOMMU 535 and memory access requests associated with a camera to the secondary IOMMU 536. Thus, the IOMMU driver 520 issues a single command to access the system memory 525 (e.g., a single memory access request) via the interface 510. The single command is then selectively handled by either the primary IOMMU 515 or one of the specialized secondary IOMMUs 535, 536. The IOMMU driver 520 therefore implements the access policy without being required to address the dedicated IOMMU 515, 535, 536 separately or independently.
In response to receiving a memory access request from a device of the appropriate type, the primary IOMMU 515 or the secondary IOMMUs 535, 536 access entries in corresponding TLBs 545, 546, 547 (collectively referred to herein as “the TLBs 545-547”) to attempt to locate a translation of the address included in the memory access request. The entries in the TLBs 545-547 encode information indicating whether the requesting device is permitted to access the system memory 525. If the address hits in the corresponding TLB 545-547 and the device has the appropriate permissions, the memory access request is forwarded to the system memory 525. If the address misses in the corresponding TLB 545-547, the memory access request is forwarded to a corresponding page table 550, 551, 552, which returns the appropriate translation of the address. Entries in the TLBs 545-547 are updated based on the replacement policy implemented by the TLBs 545-547. If the device does not have the appropriate permissions, the memory access request is denied.
Some embodiments of the networked IOMMU 505 include a command queue 530 that receives memory access requests from the IOMMU driver 520 and stores the access requests before they are issued to the primary IOMMU 515. The command queue 530 allows system software to initiate page table and device re-scans that are forwarded to the primary IOMMU 515 or the secondary IOMMUs 535, 536, which are therefore able to cache relevant data in the corresponding TLBs 545-547. Some embodiments of the command queue 530 also allow rescans and synchronization of system software with hardware units to ensure that the software does not modify table data that is in flight.
The SOC device translation block 605 includes a primary IOMMU 610 that receives memory access requests from devices including a graphics pipeline (GFX) 615 in a GPU and peripheral devices such as a display 620, a camera 625, and the like. In some embodiments, the memory access requests are received from a first address translation layer, e.g., an address translation layer that is implemented using a GPU VM and TLB, and the memory access requests include a domain physical address generated by the first address translation layer. The memory access requests are used to access system memory such as DRAM 630. As discussed herein, the primary IOMMU 610 selectively performs address translations on the addresses included in the memory access requests based on the type of the device that issued the request. Memory access requests from device types that are not translated by the primary IOMMU 610 are forwarded to a distributed remote IOMMU network 635 that includes one or more IOMMUs associated with the display 620, the camera 625, and other peripheral devices. Some embodiments of the distributed remote IOMMU network 635 are implemented using the secondary IOMMUs 535, 536 shown in
In operation, a kernel mode driver or memory manager 650 provides signaling 655 to configure address translation tables such as page tables that are used to translate virtual addresses to GPU physical addresses (or domain physical addresses), e.g., using a first layer of address translation performed by a GPU VM and associated TLB. The memory manager 650 also provides virtual addresses 656 such as GPU virtual addresses to the virtual-to-physical manager 645. A hypervisor or hypervisor abstraction layer (HAL) 660 manages system physical page tables and access permissions stored in the DRAM 630. The HAL 660 also configures the primary IOMMU 610 in the SOC device translation layer 605. The GFX 615 attempts to translate virtual addresses using the translation cache 640. If the attempt hits in the translation cache 640, the returned address translation is used for further processing. If the attempt misses in the translation cache 640, the request is forwarded to the primary IOMMU 610, which handles the subsequent address translation as discussed herein. The primary IOMMU 610 and the distributed remote IOMMU network 635 are also able to access the DRAM 630 to perform page table walks, as discussed herein.
6.
At block 705, the networked IOMMU receives a memory access request from a device of a particular type. Examples of device types include a graphics processor type, a peripheral device type, and the like. In some embodiments, the memory access request is received from a first address translation layer, e.g., an address translation layer that is implemented using a GPU VM and TLB, and the memory access request includes a domain physical address generated by the first address translation layer.
At decision block 710, the networked IOMMU determines the type of device that issued the memory access request, e.g., based on information included in the request. If the type of device that issued the memory access request is a peripheral device, the method 700 flows to block 715. If the type of device that issued the memory access request is a GPU device, the method 700 flows to block 720.
At block 715, the primary IOMMU in the networked IOMMU bypasses address translation for memory access requests from peripheral device types. The method 700 then flows to block 725 and the primary IOMMU forwards the memory access request to a secondary IOMMU associated with the requesting device. For example, the primary IOMMU forwards the memory access request to a secondary IOMMU integrated in display circuitry in response to the memory access request being from a display. The secondary IOMMU then performs the address translation at block 730.
At block 720, the primary IOMMU in the networked IOMMU performs the requested address translation for memory access requests from the GPU.
A computer readable storage medium includes any non-transitory storage medium, or combination of non-transitory storage media, accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media include, but are not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc , magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
In some embodiments, certain aspects of the techniques described above are implemented by one or more processors of a processing system executing software. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software includes the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium includes, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like. The executable instructions stored on the non-transitory computer readable storage medium are in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.