The present disclosure relates to performing graphics processing, and more particularly to precision-modulating shading (PMS) in a graphics processing pipeline.
Floating-point number formats are computer number formats which may be useful in computer processing tasks such as graphics processing and artificial intelligence (AI). The single-precision floating-point format, also referred to as an FP32 format or a float32 format, is a 32-bit floating-point number format which is defined in the Institute of Electrical and Electronics Engineers (IEEE) 754 standard. The FP32 format may be useful because it may allow a relatively large dynamic range to be represented with relatively high precision, but FP32 operations may require a large amount of processing and memory capacity because 32 bits are used to represent each floating-point number.
Recently, other floating-point number formats, for example a 16-bit brain floating-point format referred to as BF16 format or a bfloat16 format, have been used to reduce processor and memory requirements in graphics processing and AI operations. However, the precision which floating-point number formats such as the BF16 format are capable of representing may be significantly lower than the precision of the FP32 format. This reduced precision may be suitable for some operations, but may cause significant processing errors when used for other operations. Therefore, when less-precise floating-point number formats such as the BF16 format are applied to all instructions in a processing pipeline, the results provided by the processing pipeline, for example graphics rendering results, may be significantly corrupted in comparison with results provided by more precise number formats such as the FP32 format.
Therefore, there is a need for methods and apparatuses which may selectively apply different floating-point number formats to instructions in a processing pipeline in order to reduce processing and memory requirements while reducing errors and corruptions in processing results.
Provided are apparatuses and methods for performing precision modulated shading (PMS).
Also provided are apparatuses and methods for inserting brain floating-point operations having different precisions in appropriate locations in a processing pipeline by identifying code sections, or for example instructions, which are sensitive to a precision that will affect processing results such as rendering quality.
Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments.
In accordance with an aspect of the disclosure, a method of performing precision-modulated shading (PMS) using a graphics processing unit (GPU) includes: obtaining a shading instruction corresponding to a floating-point operand; determining a precision mode which applies to the shading instruction from among a plurality of precision modes for processing shading instructions; and based on the determined precision mode, truncating the floating-point operand, and executing the shading instruction using the truncated floating-point operand.
In accordance with an aspect of the disclosure, a method of performing precision-modulated shading (PMS) using a graphics processing unit (GPU) includes: obtaining a control flow graph including a plurality of shading instructions; setting a precision mode for the plurality of shading instructions to be a default precision mode, wherein the default precision mode corresponds to a first precision level; evaluating each instruction in the plurality of shading instructions to determine whether to apply a modified precision mode, wherein the modified precision mode corresponds to a second precision level that is different from the first precision level; and based on the default precision mode being applied to a first shading instruction from among the plurality of shading instructions, controlling at least one shader processor to: set a mode register included in the at least one shader processor to a first value corresponding to the default precision mode, truncate a first floating-point operand corresponding to the first shading instruction, and execute the first shading instruction using a computation module included in the at least one shader processor based on the truncated first floating-point operand.
In accordance with an aspect of the disclosure, a graphics processing unit (GPU) for performing precision-modulated shading (PMS), includes at least one shader processor configured to: obtain a shading instruction corresponding to a floating-point operand; determine a precision mode which applies to the shading instruction from among a plurality of precision modes for processing shading instructions; and based on the determined precision mode, truncate the floating-point operand, and execute the shading instruction using the truncated floating-point operand.
In accordance with an aspect of the disclosure, a device for performing precision-modulated shading (PMS) includes a graphics processing unit (GPU) including at least one shader processor, wherein the at least one shader processor includes a mode register and a computation module; and at least one controller configured to: obtain a control flow graph including a plurality of shading instructions; set a precision mode for the plurality of shading instructions to be a default precision mode, wherein the default precision mode corresponds to a first precision level; and evaluate each instruction in the plurality of shading instructions to determine whether to apply a modified precision mode, wherein the modified precision mode corresponds to a second precision level that is different from the first precision level, wherein based on the default precision mode being applied to a first shading instruction from among the plurality of shading instructions, the at least one controller is further configured to control the at least one shader processor to set the mode register to a first value corresponding to the default precision mode, truncate a first floating-point operand corresponding to the first shading instruction, and execute the first shading instruction using the computation module based on the truncated first floating-point operand.
The above and other aspects, features, and advantages of certain embodiments of the present disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:
Advantages and features of embodiments of the disclosure, and methods of achieving them, will be more apparent with reference to the description below in conjunction with the accompanying drawings. However, embodiments are not limited thereto. In addition, specific configurations described only in a particular embodiment may be used in other embodiments. Throughout the description below, the same reference numerals may generally refer to the same elements.
The terminology used herein is for the purpose of describing example embodiments and is not intended to limit the scope of the disclosure. In this specification, the singular also includes the plural, unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” may mean that a recited element, step, operation, and/or apparatus does not exclude the presence or addition of one or more other elements, steps, operations, and/or apparatuses.
Unless otherwise defined, all terms (including technical and scientific terms) used herein may be used with the meaning commonly understood by those of ordinary skill in the art to which this disclosure belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless clearly defined in particular.
In addition, before proceeding with the detailed description that follows, definitions of certain words and phrases used herein are set forth. The terms “comprise” and “include” and derivatives of the terms “comprise” and “include” denote inclusive without limitation. The word “connects” and derivatives of the word “connect” refer to any direct or indirect communication between two or more components, whether or not the two or more components are in physical contact with each other. The terms “transmit”, “receive”, and “communicate”, and derivatives of the terms “transmit”, “receive”, and “communicate” include both direct and indirect communication. The word “or” is an inclusive word meaning ‘and/or’. The word “related to” and derivatives of “related to” denote to include, to be included in, to interconnect with, to imply, to be implied in, to connect with, to combine with, to communicate with, to cooperate with, to intervene, to place alongside, to approximate, to be bound by, to have, to have the characteristics of, to relate to, and the like. The term “controller” denotes any apparatus, system, or part thereof that controls at least one operation. Such a controller may be implemented in hardware or a combination of hardware and software and/or firmware. Functions associated with any particular controller may be centralized or distributed, either locally or remotely. The phrase “at least one”, when used with a list of items, denotes that different combinations of one or more of the listed items may be used, and that only one item in the list may be required. For example, “at least one of A, B, and C” includes any one of combinations of A, B, C, A and B, A and C, B and C, and A, B and C.
In addition, various functions described below may be implemented or supported by artificial intelligence technology or one or more computer programs, and each of the programs may include computer-readable program code and may be embodied in a computer-readable medium. The terms “application” and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or portions thereof suitable for implementation of suitable computer-readable program code. The term “computer-readable program code” includes computer code of any type, including source code, object code, and executable code. The term “computer-readable medium” includes any type of medium that may be accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disk (CD), a digital video disk (DVD), or any other type of memory. A “non-transitory” computer-readable medium excludes wired, wireless, optical, or other communication links that transmit transitory electrical or other signals. Non-transitory computer-readable media includes media in which data may be permanently stored, and media in which data is stored and may be overwritten later, such as a rewritable optical disc or a removable memory apparatus.
In various example embodiments described below, a hardware approach is described as an example. However, because various example embodiments include technology using both hardware and software, the various example embodiments do not exclude a software-based approach.
In addition, terms used in the description below, are examples for convenience of description. Accordingly, the example embodiments are not limited to the terms described below, and other terms having equivalent technical meanings may be used.
In embodiments, a thread may refer to a smallest sequence of instructions which can be managed independently, and a thread block may refer to a group of threads which may be executed serially or in parallel. A wave or warp may refer to a group of thread blocks which run concurrently. In embodiments, the shader pipe input module 323 may allocate resources and assign waves to an available wave slots in the one or more shader modules 321 for execution. The shader module 321 may schedule waves to interleave execution of their instructions and controls how instructions are executed. In embodiments, the SIMD module 3211 may process single instruction on multiple pieces of data, for example data corresponding to multiple threads. In embodiments, the SIMD module 3211 may be a SIMD32 module capable of processing data corresponding to thirty-two threads, but embodiments are not limited thereto. The SIMD module 3211 may execute the instructions according to a precision level corresponding to a precision mode, which may be indicated by a value stored in the mode register 3212. In embodiments, the SIMD module 3211 may be referred to as a computation module. In embodiments, a wave may correspond to at least one of a vertex, a pixel, a primitive, or any other element processed by the GPU 203. When the wave is finished processing, a result of the processing may be exported to the shader export module 322.
As discussed above, a floating-point number format such as the FP32 format may allow a relatively large dynamic range to be represented with relatively high precision, but may also impose a relatively large processing and memory cost. In order to reduce this processing and memory cost, other floating-point number formats such as the BF16 format may be used. However, these floating-point number formats may result in reduced precision in comparison with the FP32 format, which may cause processing errors and corruption when used with some operations or instructions executed by a processing pipeline.
Therefore, embodiments are directed to a process for performing precision modulated shading (PMS), which may enable a plurality of floating-point number formats to be used by a processing pipeline, for example the pipeline 204 including the shader module 321 discussed above.
For example, a binary number “101011.101” may be represented using scientific notation as “1.01011101*2{circumflex over ( )}5”. To represent this example binary number using a floating-point number format, a sign bit S may be used to indicate that the number is positive, exponent bits E may be used to indicate the bits “101” to represent the exponent of “5”, and fraction bits F may be used to indicate the bits “01011101” to the right of the binary point (with an implicit bit of “1” to the left of the binary point) to represent the mantissa.
As shown in
As discussed above, embodiments may relate to a process for performing PMS, which may enable a pipeline, for example the pipeline 204, to perform operations using different floating-point number formats based on a precision level required for those operations. In embodiments, these other floating-point formats may have the same number of sign bits S and exponent bits E as the FP32 format, but may vary the number of fraction bits F used to represent the mantissa. For example, as shown in
Although embodiments are described herein as using floating-point number formats such as the BF16 format, the BF20 format, the BF24 format, the BF28 format, and the FP32 format, embodiments are not limited thereto, and embodiments may be applied to any other floating point format.
In embodiments, the pipeline 500 may correspond to operations performed by elements of the GPU 203, for example the pipeline 204 and the elements included therein. For example, the input assembly 501 and the tessellation 503 may be performed by the geometry module 2042, the rasterization may be performed by the rasterizing module 2043, the color blending may be performed by the texture module 2045, and the vertex shading 502, the geometry shading 504, and the fragment shading 506 may be performed by one or more shader modules 321.
According to embodiments, PMS may allow floating-point operations to be performed with different precision levels by distinguishing instructions which are sensitive to precision from instructions which are not sensitive to precision when the instructions are processed in the pipeline 204, for example by the shader module 321 performing functions such as vertex shading 502, geometry shading 504, and fragment shading 506. PMS may allow different precision modes to be selected for precision-sensitive instructions and instructions which are not precision-sensitive, thereby allowing the precision-sensitive instructions or instruction blocks to be executed using a precision mode corresponding to a higher precision level (e.g., a precision level corresponding to a relatively high-precision floating-point number format such as the FP32 format), and allowing the instructions or instruction blocks which are not precision-sensitive to be executed using a precision mode corresponding to a lower precision level (e.g., a precision level corresponding to a relatively low-precision floating-point number format such as the BF16 format).
In embodiments, by allowing the pipeline 204 to switch between different precision modes corresponding to different floating-point number formats, PMS may allow dynamic clipping or truncating of the number of fraction bits F used to represent the mantissa of FP32 operands in instructions executed by a shader module 321 based on a particular precision mode which is used by the shader module 321. As a result, embodiments may provide a pipeline 204 which is capable of performing operations using different floating-point number formats which have different precision levels, for example a range of precision levels between the BF16 and FP32 formats.
In some embodiments, the precision mode may be selected based on a hint that is capable of being applied using programming languages at various levels, which may allow PMS to be utilized without modification of a user-level application.
In embodiments, the one or more shader modules 321 may be responsible for a significant portion of the power consumption of the GPU 203. By reducing the number of calculations performed by the one or more shader modules 321, embodiments may allow for a reduction in power consumption while adding only a small amount of overhead operations to switch precision modes according to PMS. Accordingly, embodiments may allow rendering quality to be maintained while reducing processing and memory requirements, thereby saving power through reduction of the overall amount of calculation.
In embodiments, this precision mode which is to be used by a particular shader module 321 to perform a particular operation may be determined or set based on a value stored in a mode register corresponding to the shader module 321. In embodiments, the mode register may be included in the shader module 321, however embodiments are not limited thereto. For example, in some embodiments the mode register may be included in the memory subsystem 205, or in another portion on the GPU 203 or the SOC 200. In embodiments, the mode register may indicate the number of LSB bits of an FP32 operand which are to be masked. In embodiments, the mode register value stored in the mode register may be read and written by the shader module 321 using scalar instructions.
In embodiments, when PMS is enabled, the mode register value may be set and updated per wave. For example, when an instruction or command to launch a new wave is received, the mode register corresponding to the wave may be initialized by setting the mode register value to a default value corresponding to a default precision mode. Then, based on receiving an instruction or command to change the precision mode for the wave to a new precision mode, the mode register value may be updated to indicate the new precision mode.
For example, in some embodiments the mode register value may be a 3-bit value. Accordingly, the default value may be “000”, which may indicate a precision mode corresponding to the FP32 format. After the mode register is initialized to store the default value, an instruction or command may be received to change a precision mode according to the operations to be performed in the wave. Based on the changed precision mode, the mode register value may be updated to “001”, which may indicate a precision mode corresponding to the BF28 format, “010”, which may indicate a precision mode corresponding to the BF24 format, “100”, which may indicate a precision mode corresponding to the BF20 format, or “010”, which may indicate a precision mode corresponding to the BF16 format. However, embodiments are not limited thereto, and in embodiments any mode register value may be used to represent any precision mode corresponding to any number format. In embodiments, the updated mode register value may be visible to instructions which are subsequent to the instruction or command for setting the mode register value in a program order, and may not be visible to instructions which precede the instruction or command for setting the mode register value in the program order.
In some embodiments, the command or instruction to update the mode register value may be a dedicated register setting instruction which is used only to change the mode register value. For example, the dedicated register setting instruction may include an explicit scalar value which may be stored in the mode register. As another example, the dedicated register setting instruction may indicate a scalar register which stores the mode register value, and based on the mode register instruction being received, the mode register value may be retrieved from the scalar register and stored in the mode register. This may allow programmatic determination of the precision mode. For example, a desired precision mode may depend on a result of a calculation. After the calculation is performed, the result of the calculation may be stored in the scalar register, and then retrieved to be stored in the mode register in order to set the appropriate precision mode. In embodiments, the dedicated register setting instruction may be referred to as a mode instruction.
In some embodiments, the command or instruction to update the mode register value may be a modified instruction which is sometimes used to change the mode register value, and is otherwise used to perform different operations. For example, a dedicated register setting instruction may not be executed until all instructions sent to the shader module 321 are completed, and therefore may require a set of dependency counters to reach zero before it can be executed. This may increase a latency of the shader module 321 and may decrease a locality of a cache used to store operands for the shader module 321, which may cause both power and performance issues. However, because only vector instructions may be affected by the precision mode, and other instructions such as scalar instructions and branch instructions may not depend on the precision mode, it may not be necessary to wait until all instructions sent to the shader module are completed. Accordingly, a different instruction included in a given instruction set architecture (ISA), which is not subject to the wait-state penalty of the dedicated register setting instruction, may be conditionally modified in order to allow it to be used to set the mode register value. For example, an unused bit in the modified instruction may be designated as a control bit. Based on the control bit having a first value, the modified instruction may be interpreted as the register setting instruction, and based on the control bit having a second value, the modified instruction may be interpreted as the original instruction before modification. For example, the original instruction may include a denorming instruction or a rounding instruction, but embodiments are not limited thereto.
In some embodiments, the control bit may be checked based on a determination of whether PMS is enabled or disabled. For example, a configuration file or configuration bits for the GPU 203 may include a chicken bit corresponding to PMS. In embodiments, the chicken bit may be included in a configuration file corresponding to the GPU 203. Based on the chicken bit having a first value, PMS may be enabled, and the GPU 203 may check the control bit before executing the modified instruction. Based on the chicken bit having a second value, the control bit of the modified instruction may be ignored.
According to embodiments, the behavior of the shader module 321 may be modified in other ways based on the mode register value stored in the mode register. For example, when the mode register value is set to any value other than the default value, the floating-point rounding mode may be assumed to be a round-to-zero mode, and denormals may always be flushed to zero. In embodiments, operations such as fetching an exporting operands, for example reading and writing operands to and from an operand cache, may be performed at a normal level specified by the instruction, and may be not impacted or modified based on the precision mode.
In some embodiments, when the precision mode is any mode other than the default mode, operands may be clipped or truncated and rounded to zero according to the precision mode upon insertion into the pipeline 204. However, embodiments are not limited thereto, and in some embodiments the truncating maybe applied elsewhere. The truncating may be applied to all FP32 operands, regardless of the source of the operands (e.g. read from cache, destination buffer direct forwards, etc.). The truncating may be applied before input exception checking is performed in order to ensure functionally correct behavior. The truncated operands may be rounded to zero after input denormal exception checking, but before other exception checking, for example other IEEE exception checking.
In some embodiments, the truncating may be applied only to inputs of the shader module 321, and not to outputs of the shader module 321. For example, for an operation corresponding to the expression a×b=c, the values of a and b may be inputs to the shader module 321, and the value of c may be an output of the shader module 321. Accordingly, the values of a and b may be truncated based on the precision mode, and the value of c which is output by the shader module 321 may not be truncated. The results of operations (e.g., the value of c) may be zero-padded in order to isolate changes within the pipeline 204.
In some embodiments, the mode register value may be propagated along with the instructions within the pipeline 204, so that instructions with different precision modes may exist in different stages of the pipeline 204 concurrently.
In some embodiments, one or more instructions may be excluded from PMS processing because changing the precision mode may impact the output of those instructions, or because changing the precision mode would not result in sufficient power savings. For example, certain instructions may be included in a whitelist, which may indicate that PMS is to be applied. Based on determining that an instruction is not included in the whitelist, the GPU 203 may execute the instruction without applying different precision modes according to PMS.
In some embodiments, implementing PMS may result in changes to exception behavior of the GPU 203, for example IEEE numerical exceptions. In embodiments, some IEEE numerical exceptions may be affected while other IEEE numerical exceptions may not be affected. For example, because denormals may be flushed to zero when the precision mode is any mode other than the default mode, it is not possible to receive more underflow exceptions. However, it is possible to receive new overflow exceptions from instructions for which the truncated operands produce larger results than the original non-truncated operands. Further, implementing PMS may result in more input denormal exceptions, and because more denormal results may be produced, implementing PMS may result in a higher input denormal count. Additional divide-by-zero exceptions should not occur due to truncating operands, so implementing PMS may not cause changes to divide-by-zero exceptions. However, truncating operands according to PMS may result in additional inexact exceptions, and inexact exceptions that may be raised earlier in an instruction when executed based on the non-truncated operand may not occur when the instruction is executed based on the truncated operand. For example, an instruction may produce an output of (m/(n+e)), which is exact, based on non-truncated operands m and (n+e). However, when the operands are truncated according to a particular precision mode, the truncated operands may be m and n. Therefore, the instruction may produce an output of (m/n) based on the truncated operands, which may be inexact. Implementing PMS may produce no changes to invalid exceptions. In addition, because integer operations may not be affected, implementing PMS may cause no changes to integer divide-by-zero exceptions.
According to embodiments, when the PMS is enabled, the GPU 203 may perform PMS by applying different precision modes to various instructions. The GPU 203 may determine whether to apply different precision modes an instruction based on a heuristic.
As shown in
As further shown in
As further shown in
As further shown in
As further shown in
As further shown in
As shown in
For example, the predetermined shading instruction may be a shading instruction which is known to be precision-sensitive. This may mean that there is a high likelihood that the shading instruction will produce incorrect or corrupted results if operands of the shading instruction are truncated based on the default precision mode. Therefore, based on determining that the shading instruction is a precision-sensitive instruction, the modified precision mode (e.g., the high precision mode) is set for the instruction.
As shown in
As further shown in
As further shown in
As further shown in
As further shown in
As further shown in operation S640, based on determining that the operand is not a vector operand (N at operation S635), the process 600C may proceed to operation S640, and determine whether the last operand has been reached. Based on determining that the last operand has been reached (Y at operation S640), the process 600C may proceed to operation S638. Based on determining that the last operand has not been reached, the process 600C may return to operation S634. In some embodiments, if the last operand at a present depth has been reached, the process 600C may include decrementing the depth counter before returning to operation S634, so that all operands at each depth may be evaluated.
Therefore, the process 600C may be used to evaluate additional instructions in an upward direction along a use-definition chain of shading instructions, in order to ensure that an appropriate precision level is maintained in operands as they are provided or produced in a downward direction along the use-definition chain.
According to embodiments, the heuristic described above with respect to
As shown in
As further shown in
As further shown in
The truncating of the floating-point operand may include truncating a mantissa of the floating-point operand while maintaining an exponent of the floating-point operand. In embodiments, a precision of the truncated floating-point operand may be lower than a precision of the floating-point operand, and a dynamic range of the truncated floating-point operand may be same as a dynamic range of the floating-point operand.
In embodiments, the process 800 may further include determining whether the shading instruction is included in a whitelist, wherein based on determining that the shading instruction is not included in the whitelist, the shading instruction may be processed using the floating-point operand without determining the precision mode.
In embodiments, the determining the precision mode may include determining whether the PMS is enabled based on a value of a configuration bit included in a configuration file corresponding to the GPU, wherein based on the PMS being enabled, the plurality of precision modes may be used to process the shading instructions, and wherein based on the PMS being not enabled, the plurality of precision modes may be not used to process the shading instructions.
In embodiments, the process 800 may further include receiving a mode instruction for setting a new mode; and changing the value stored in the mode register to a new value corresponding to the new mode, wherein the new value may be visible to instructions subsequent to the mode instruction in a program order, and may be not visible to instructions prior to the mode instruction in the program order. In embodiments, the mode instruction may correspond to at least one of the register setting instruction and the modified instruction discussed above.
In embodiments, the mode instruction may include a control bit and instruction bits, wherein based on the PMS being enabled and the control bit being set, the instruction bits may be interpreted as the mode instruction, wherein based on the PMS being enabled and the control bit being not set the instruction bits may be interpreted as a different instruction, and wherein based on the PMS being not enabled, the control bit may be ignored, and the instruction bits may be interpreted as the different instruction. In embodiments, the mode instruction may correspond to the modified instruction discussed above, and the different instruction may correspond to the original instruction discussed above.
Although
While various example embodiments have been particularly shown and described with reference to the drawings, it will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the following claims.
This application is based on and claims priority under 35 U.S.C. § 119 to U.S. Provisional Patent Application No. 63/526,067, filed on Jul. 11, 2023, in the United States Patent and Trademark Office, the disclosure of which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63526067 | Jul 2023 | US |