This application claims priority pursuant to 35 U.S.C. 119(a) to United Kingdom Application No. 2114314.4, filed Oct. 6, 2021, which application is incorporated herein by reference in its entirety.
This disclosure relates to circuitry and methods.
Graphics processing units (GPUs) are used to perform rendering and other processing operations which may be related to the generation of image data. It is also known to use GPUs for other processing operations, and indeed to use other types of processors to perform graphics or non-graphics processing.
The present disclosure concerns potential improvements in such arrangements.
In an example arrangement there is provided circuitry comprising:
In another example arrangement there is provided a method comprising:
In another example arrangement there is provided a method comprising:
Further respective aspects and features of the present technology are defined by the appended claims.
The present technique will be described further, by way of example only, with reference to embodiments thereof as illustrated in the accompanying drawings, in which:
Circuitry Overview
In operation, each of the CPU 100 and the GPU 110 may perform respective processing tasks and more generally may be any device defined as a “processor” in the discussion above, and/or one or more other devices falling in this definition may be provided. For example, tasks performed by the CPU 105 may relate to control operations and tasks performed by the GPU 110 may relate to data handling operations such as image or video data rendering. However, this is just one example and other types of operations may be performed. Indeed, the use of a CPU 105 and GPU 110 is also just one schematic example and other types of and/or numbers of processors may be employed.
In the example shown, each of the CPU 100 and the GPU 110 may comprise, for example, respective execution engine circuitry having one or more cache memories. The various cache memories can form a hierarchy, so that if a respective execution engine circuitry requires access to a data item (which may represent a processing instruction and/or data to be handled by a processing instruction) it will try first to obtain or access that data item in a level 1 cache memory. In the case of a cache miss a search will be performed through the next closest cache memory levels, with an access to the memory circuitry 142 of the main memory being used only if the attempted cache memory accesses all miss. When the required data item is obtained from the memory circuitry 142 a copy may be saved in one or more of the cache memories.
The main memory 140 comprises memory circuitry 142, a memory controller 144 to control access to and from the memory circuitry 142 and may be associated with a cache memory such as a so-called level 3 cache memory.
In some examples, for a write, the system may fetch the line (as a “line fill”) and then allocate that line in a cache. A write can then be performed into the line. Alternatively, a line can be allocated in the cache, and data written into the line. However, in this case (unless the entire line is written) there may be a need to keep information indicating which portions of the line were written (and which portions were not).
Although they are drawn as single respective entities the processors 100, 110 may in fact be embodied as multiple core processors or clusters. For example, there may be 8×CPU and/or 16×GPU.
In the examples discussed below, features of the present techniques are applied to example rendering processing operations performed by the GPU. However, it will be appreciated that the techniques may also be applied to processing operations other than rendering operations and to operations performed by types of processors (such as the CPU 105) other than a GPU. In examples, however, the processing circuity comprises graphics processing circuitry configured to perform, as the processing operation, a graphics rendering operation such as a tile based deferred rendering operation. In examples, which each graphics rendering operation relates to a respective image tile of a graphics image to be rendered
A job manager 200 controls the execution of processing tasks or jobs, for example being tasks or jobs established by the CPU 105, with the GPU-specific execution been performed by a set of shader cores 210 and tiler circuitry 220. In some examples, an on-chip memory 255 can be provided to store data being operated upon by the GPU. This provides an example of memory circuitry to provide storage according to the physical memory addresses in a physical memory address space, and an example in which the memory circuitry and the processing circuitry are fabricated as a common integrated circuit device.
The shader cores are processing units specifically optimized or designed for handling instructions, for example in the form of shader code in order to manipulate the pixels and polygon vertices within an image so as to render portions of that image.
The tiler circuitry handles portions of the GPU rendering operations, these portions corresponding to discrete regions or tiles of the rendered image. This process (of dividing the overall processing into tasks or regions) can reduce the instantaneous memory and data transfer requirements which occur during the rendering process by the GPU 110. The job manager 200 allocates jobs to the shader cores 210 and to the tiler circuitry 220.
In the drawing of
Detector circuitry 230 and control circuitry 240 are provided and perform respective functions to be described below.
Therefore,
Note that the circuitry may provide two or more instances of processing circuitry 210, 220, each associated with control circuitry; and allocation circuitry 200 to allocate program code defining respective processing operations of a plurality of processing operations to the two or more instances of the processing circuitry.
Tile-Based Deferred Rendering (TBDR)
Examples of the present techniques are particularly applicable to TBDR techniques, in part because such TBDR techniques often perform fragment processing (to be discussed below) using on-chip storage. However, the present techniques are potentially applicable to other graphics or non-graphics techniques such as so-called immediate mode rendering (IMR).
In general terms, in a TBDR architecture, so-called graphics “primitives” are sorted such that fragment processing can be handled in independent tiles which are processed locally, for example using an on-chip frame buffer storage. The GPU job manager 200 allocates tiles to the (potentially plural) shader cores 210 for fragment processing.
In the geometric pass 300, operations are performed relating to vertex shading, tessellation, geometry calculations and the generation of three-dimensional coordinates of post-transformation positions. The tiler calculates where each primitive may be visible. The frame buffer is treated as multiple tiles (for example, 32×32 pixel squares) and for each primitive, a detection is made as to which of these tiles it may potentially touch. The processing associated with a primitive is culled when the primitive is obviously not visible. The output of the geometric pass 300 is that the tiler produces a polygon list of all primitives which are at least potentially visible in the final rendered image.
At the fragment pass 310, for each 32×32 pixel tile, a walk is performed through the polygon list so that each primitive is considered in turn and any visible effect of the primitive relating to that tile is rasterized and shaded. In other words, rendering is deferred to the fragment pass 310.
In the fragment pass 310, one tile is processed at a time and jobs are allocated for processing on a tile-by-tile basis. Techniques to be described below allow for the potential culling or at least potentially early termination of tile processing jobs, so that processing can move on to a next job. This can potentially allow for a power saving (by not requiring the processing of jobs which can be culled or at least terminated early using these techniques) and/or a performance improvement by allowing more tiles to be processed within a given time period such as an output frame period in a video generation system.
Relating the steps of
Sparse Resources
So-called sparse resources represent a feature available in various processors including at least some example GPUs to allow control of the memory backing of buffers. The use of sparse resources allows information such as graphical textures, which are themselves too large to fit into available memory resources such as the on-chip memory 255, to be handled. A virtual memory region is established as a plurality of smaller sections or control granules (typically of 64 kB but other sizes could be used). Some of these control granules are established to be memory-backed, which is to say they are mapped to corresponding sections of physical memory such as the on-chip memory 255, and some are not mapped to physical memory (“unmapped”). The mapping can be varied from time to time, for example under application program control as discussed below (unlike some other arrangements in which any changes to the mapping of virtual addresses to physical addresses always require a higher security level such as that applicable to an operating system or hypervisor). In an example use of this technique, a large texture can be established for use in many image scenes, with the memory backing being changed according to current requirements for a subset of the texture. This avoids the repeated allocation of a virtual memory address space and simply requires a change of the physical backing from time to time.
Sparse resources can be defined with respect to any of: input images, output caller buffers, textures and the like.
As background to the following discussion,
The virtual address (VA) 400 is provided to a buffer or cache of translation information referred to as a translation lookaside buffer (TLB) 402. If the required translation information is held by the TLB then it is output as a physical address (PA) 404, assuming that the current memory address is mapped to physical memory, or an indicator in place of the PA when the current memory address is unmapped, and which is provided to a memory controller 405. If the required translation information is not currently held by the TLB then the TLB consults translation circuitry 410 which in turn can reference one or more page tables 420 providing translation and optionally mapping information. At least the mapping information can be updated when required and when such an update takes place, any corresponding translation information held by the TLB 402 in invalidated. A driver together with the operating system manages the page tables. The driver may get input from the application to manage sparse resource mappings. The GPU is a consumer of page tables and from the GPU perspective they are read-only.
Assuming a valid PA is provided to the memory controller 405, the memory controller accesses the on-chip memory 255 using that PA and either rights the data 430 to it or read the contents at that PA, returning the read data to the memory controller 405 for output as the data 440. Significantly, and as discussed further below, a write operation to an unmapped sparse resource is quietly discarded (which is to say, it is discarded without generating a fault) and in the case of a read operation from an unmapped sparse resource, the memory controller 405 outputs zero as the read data 440 without generating a fault condition.
Therefore,
Referring to
Side Effects
So-called “side-effects” will now be discussed as further background to the present techniques.
In
In
The present techniques recognize that where a job to be performed (for example) as part of the fragment pass 310 accesses entirely unmapped memory addresses in a sparse resource system, that job need not be performed, or at least need not be completed, with no adverse effect and potentially with a power or resource benefit by culling or early-terminating that job.
The detector circuitry 230 is responsive to address information defining a range of addresses within a sparse resource to be accessed by that job and optionally to flag data (discussed further below) indicative of whether the job implements a side effect as discussed above. The detector circuitry 230 queries the MMU 250 to find the mapping status applicable to the address information using mapping information 1210, for example of the form described with reference to
Therefore, in these examples, the detector circuitry is configured to detect the indicator data stored by the one or more memory page tables in respect of virtual memory addresses defined by the memory region.
Based upon the information 1220, the control circuitry 240 controls the shader core 210 as follows:
In other examples, this information (provided by the indicator data) does not have to be stored in the page tables, even though this provides a convenient place to store it. Other options are possible. Another example of keeping the mapping information would be to have a shadow buffer for each texture with (for example) one bit of information per physical page of the primary buffer to indicate the current mapping status.
The shader core 210 requests a next job from the job manager 200 in response to culling or termination of a current job, whether that termination is early or at completion of the current job. This provides an example in which in response to the control circuitry inhibiting completion of a processing operation by a given instance of the processing circuitry, the allocation circuitry is configured to allocate program code defining a next processing operation to the given processing circuitry.
Referring to
In
By way of summary of the techniques described above,
It will therefore be apparent that in at least some examples, the control circuitry 240 is configured to inhibit execution of the processing operation in response to a detection by the detector circuitry that the memory region applicable to that processing operation is entirely within the unmapped subset of the virtual memory address space. In other examples, early termination of an already started operation can be provided. It will be noted that either outcome is a potential optimization or improvement; if the detection and control operations do not inhibit or terminate operation even in a situation in which potentially they could do so, the situation is no worse than it would have been without the present techniques.
In the case that a side effect is detected, the detector circuitry may, as discussed above, be configured to detect whether the program code defines one or more further operations to generate output data other than the processed data stored to the memory region of the virtual memory address space; and the control circuitry may be configured to allow completion of the one or more further operations while inhibiting completion of the processing operation, in response to a detection by the detector circuitry that the program code defines one or more such further operations.
Accordingly,
Note that as part of the compilation process described here, the generated code itself can contain instructions which:
Referring to
General Matters
In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
Although illustrative embodiments of the present techniques have been described in detail herein with reference to the accompanying drawings, it is to be understood that the present techniques are not limited to those precise embodiments, and that various changes, additions and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the techniques as defined by the appended claims. For example, various combinations of the features of the dependent claims could be made with the features of the independent claims without departing from the scope of the present techniques.
Number | Date | Country | Kind |
---|---|---|---|
2114314 | Oct 2021 | GB | national |
Number | Name | Date | Kind |
---|---|---|---|
7680987 | Clark | Mar 2010 | B1 |
9158681 | Samuels et al. | Oct 2015 | B1 |
10474359 | Volpe | Nov 2019 | B1 |
20040015864 | Boucher | Jan 2004 | A1 |
20050198464 | Sokolov | Sep 2005 | A1 |
20060206687 | Vega | Sep 2006 | A1 |
20080189528 | Robinson | Aug 2008 | A1 |
20100250853 | Krieger | Sep 2010 | A1 |
20110107052 | Narayanasamy | May 2011 | A1 |
20120166137 | Grasser | Jun 2012 | A1 |
20120236010 | Ginzburg | Sep 2012 | A1 |
20130297722 | Wright | Nov 2013 | A1 |
20160335134 | Gupta | Nov 2016 | A1 |
20180337204 | Bakowski Holtryd | Nov 2018 | A1 |
20190121743 | Park | Apr 2019 | A1 |
20200233612 | Dalmatov | Jul 2020 | A1 |
20210133109 | Bono | May 2021 | A1 |
20230105277 | Uhrenholt | Apr 2023 | A1 |
Number | Date | Country |
---|---|---|
112465690 | Mar 2021 | CN |
0288760 | Nov 1988 | EP |
7106775 | Jul 2022 | JP |
Number | Date | Country | |
---|---|---|---|
20230105277 A1 | Apr 2023 | US |