This application claims priority to China Application Serial Number 202210533717.7, filed on May 16, 2022, which is incorporated by reference in its entirety.
The present application relates to a processor and particularly to a GPU a method of the same.
When a GPU executes kernel code, it executes on a streaming processor (SP) with a warp as the unit. As the warp is scheduled to the SP, it takes up space in the registers on the SP. In other words, the limited register space is one of the bottlenecks in the number of warps that can be scheduled to the SP, which is an urgent issue to be addressed in the related field.
One purpose of the present disclosure is to disclose a GPU and a method of the same to address the above-mentioned issues.
One embodiment of the present disclosure discloses a GPU, configured to execute a kernel code, wherein the kernel code includes a thread block (TB), and the TB includes a plurality of warps, the GPU includes: a plurality of streaming multiprocessor (SMs), each SM includes: a plurality of streaming processors (SPs), wherein each of the plurality of SPs includes a register, wherein each of the SPs has predetermined upper bound of warp number, and the register has predetermined upper bound of register capacity; and a global dispatcher, including: a register occupancy status table, configured to record a warp number and an occupancy status of the register of each SP of each SM; a TB dispatch module, configured to dispatch the TB to a first SM of the plurality of SMs according to a warp type classification table and the register occupancy status table; and a warp dispatch module, configured to dispatch the plurality of warps to the plurality of SPs of the first SM according to the warp type classification table and the register occupancy status table.
One embodiment of the present disclosure discloses a method, including: receiving a kernel code, wherein the kernel code includes a TB, and the TB includes a plurality of warps; classifying the plurality of warps into a plurality of different types according to a function of the plurality of warps; analyzing a register space required by each type of the warp when being executed; and recording in a warp type classification table the types of the plurality of warps and the register space required by the plurality of warps when being executed.
The GPU and the method of the same disclosed in the present application can optimize the space usage of the register in the SP and thus increase the performance of the GPU.
The following disclosure provides many different embodiments or examples for implementing different features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. As could be appreciated, these are merely examples and are not intended to be limiting. For example, the formation of a first feature over or on a second feature in the description that follows may include embodiments in which the first and second features are formed in direct contact and may also include embodiments in which additional features may be formed between the first and second features, such that the first and second features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various embodiments. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
Further, spatially relative terms, such as “beneath,” “below,” “lower,” “above,” “upper,” and the like, may be used herein for ease of description to discuss one element or feature's relationship to another element (s) or feature (s) as illustrated in the drawings. These spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the drawings. The apparatus may be otherwise oriented (e.g., rotated by 90 degrees or at other orientations), and the spatially relative descriptors used herein may likewise be interpreted accordingly.
Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors necessarily resulting from the standard deviation found in the respective testing measurements. Also, as used herein, the term “the same” generally means within 10%, 5%, 1%, or 0.5% of a given value or range. Alternatively, the term “the same” means within an acceptable standard error of the mean when considered by one of ordinary skill in the art. As could be appreciated, other than in the operating/working examples, or unless otherwise expressly specified, all of the numerical ranges, amounts, values, and percentages (such as those for quantities of materials, duration of times, temperatures, operating conditions, portions of amounts, and the likes) disclosed herein should be understood as modified in all instances by the term “the same.” Accordingly, unless indicated to the contrary, the numerical parameters set forth in the present disclosure and attached claims are approximations that can vary as desired. At the very least, each numerical parameter should at least be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Here, ranges can be expressed as from one endpoint to another endpoint or between two endpoints. All ranges disclosed herein are inclusive of the endpoints, unless specified otherwise.
Generally speaking, when a GPU dispatches a warp to a streaming processor (SP), it will allocate the same size of register space of the SP to each warp, and this space will be occupied until the warp is executed by the SP. However, since different warps may represent different functions, the actual register space required by each warp may be different. To allocate the same register space to all warps equally, it is necessary to accommodate the warp that requires the most register space to determine the size of this space, resulting in low efficiency in register allocation and making register space a bottleneck in the number of warps that can be scheduled to the SP.
Then in Step 206, the complier 104 further analyze the maximum register space required when each type of warps are executed by the SP; for example, for computation type warps, the complier 104 determines that a maximum of 192 bytes is required; whereas for memory type warps, the complier 104 determines that only a maximum of 64 bytes is required. In Step 208, the complier 104 will record the type of the plurality of warps W0˜W14 and the register space required when the warps are executed in the warp type classification table 106.
The warp type classification table 106 will be sent to the GPU 108, so that the GPU 108 can dispatch the plurality of warps W0˜W14 of the tread block TB0 according to the warp type classification table 106. More specifically, in certain embodiments, the warp type classification table 106 will be added to a kernel launch command, when the GPU 108 receives the kernel launch command, it will also receives the warp type classification table 106 at the same time.
As discussed above, the GPU 108 will read the kernel code 102 from a global memory outside of the GPU 108 according to the kernel launch command to obtain the tread block TB0, and the global dispatcher 110 of the GPU 108 dispatches the tread block TB0 to one of a plurality of streaming multiprocessors (SM) SM0, SM1, . . . (for example, to the streaming multiprocessor SM0) according to the warp type classification table 106; and dispatches the tread block TB0 of the plurality of warps W0˜W14 of the tread block TB0 to the plurality of streaming processors SP0, SP1, . . . of the streaming multiprocessor SM0. The local dispatcher 122 of the streaming multiprocessor SM0 then allocates the plurality of warps W0˜W14 to the register 124 of the plurality of streaming processors SP0, SP1, . . . according to the dispatch of the global dispatcher 110.
Each of the plurality of streaming multiprocessors SM0, SM1, . . . has the plurality of streaming processors SP0, SP1, . . . , whereas each of the SPs has a register 124. Each SP of each SM has a predetermined upper bound of warp number, i.e., the number of warps can be dispatched to each SP of each SM is limited to the predetermined upper bound of warp number; further, the register 124 of each SP of each SM has a predetermined upper bound of register capacity. In the present embodiment, the predetermined upper bound of warp number of each SP of each SM is the same, and the predetermined upper bound of register capacity of the register 124 of each SP of each SM is the same; however, the present disclosure is not limited thereto.
Specifically, the global dispatcher 110 includes a register occupancy status table 112, a TB dispatch module 114 and a warp dispatch module 116. The register occupancy status table 112 is configured to record the warp number already dispatched to each SP of each SM and the occupancy status of the register.
The TB dispatch module 114 will obtain the remaining available register space of each of the streaming multiprocessor SM0˜SM1 according to the register occupancy status table 112 and the predetermined upper bound of register capacity. As could be seen in the embodiment of
The TB dispatch module 114 further obtains the remaining number of acceptable warps of each of the streaming multiprocessor SM0˜SM1 according to the register occupancy status table 112 and the predetermined upper bound of warp number. As could be seen in the embodiment of
the TB dispatch module 114 further calculates the sum of the required register space and the sum of the number of the warps of the tread block TB0 according to the warp type classification table 106. As could be seen in the embodiment of
In this way, the TB dispatch module 114 can determine whether any SM of the plurality of SMs SM0˜SM1 meets a first condition according to the sum of the required register space of the tread block TB0, the sum of the number of the warps of the tread block TB0, and the remaining available register space and the remaining number of acceptable warps of each of the plurality of SMs SM0˜SM1. For an SM to meet the above-mentioned first condition, its remaining available register space cannot be less than the sum of the required register space of the tread block TB0, and its remaining number of acceptable warps cannot be less than the sum of the number of the warps of the thread block TB0. As could be seen in the embodiments of
The TB dispatch module 114 further determine whether any SM of the plurality of SMs SM0˜SM1 meets a second condition according to the number of type I warps in the warp type classification table 106, the units of register space required when the type I warp is executed, and the register occupancy status table 112. Details of the second condition are discussed below.
For the streaming multiprocessor SM0, it can be known from the register occupancy status table 112 that the remaining number of acceptable warps of the streaming processor SP0 is 4, and the register of the streaming processor SP0 has 5 units of space available. Therefore, from the perspective of the remaining number of acceptable warps, the streaming processor SP0 can further can accepts 4 type I warp; whereas from the perspective of the register space, the streaming processor SP0 can only accept 1 type I warp (because the type I warp requires 3 units of register space). So, comprehensively, at most 1 type I warp can be dispatched to the streaming processor SP0. Hence, at most 2 type I warps can be dispatched to the streaming processor SP1; at most 3 type I warps can be dispatched to the streaming processor SP2; and at most 2 type I warps can be dispatched to the streaming processor SP3. Finally, it can be seen from the above that for the streaming multiprocessor SM0, it can further accept 8 (i.e., 1+2+3+2) type I warps at most. Since the number of type I warps that the streaming multiprocessor SM0 can accept at most (i.e., 8) is greater than the number of the type I warps of the tread block TB0 recorded in the warp type classification table 106 (i.e., 3), the streaming multiprocessor SM0 meets the second condition.
The TB dispatch module 114 further determine whether any SM of the plurality of SMs SM0˜SM1 meets a third condition according to the number of type II warps in the warp type classification table 106, the units of register space required when the type II warp is executed, and the register occupancy status table 112. Details of the third condition are discussed below.
For the streaming multiprocessor SM0, it can be known from the register occupancy status table 112 that the remaining number of acceptable warps of the streaming processor SP0 is 4, and the register of the streaming processor SP0 has 5 units of space available. Therefore, from the perspective of the remaining number of acceptable warps, the streaming processor SP0 can further can accepts 4 type I warp; whereas from the perspective of the register space, the streaming processor SP0 can only accept 5 type II warps (because the type II warp requires 1 unit of register space). So, comprehensively, at most 4 type II warps can be dispatched to the streaming processor SP0. Hence, at most 6 type II warps can be dispatched to the streaming processor SP1; at most 3 type II warps can be dispatched to the streaming processor SP2; and at most 8 type II warps can be dispatched to the streaming processor SP3. Finally, it can be seen from the above that for the streaming multiprocessor SM0, it can further accept 21 (i.e., 4+6+3+8) type II warps at most. Since the number of type II warps that the streaming multiprocessor SM0 can accept at most (i.e., 21) is greater than the number of the type II warps of the tread block TB0 recorded in the warp type classification table 106 (i.e., 8), the streaming multiprocessor SM0 meets the third condition.
The TB dispatch module 114 further determine whether any SM of the plurality of SMs SM0˜SM1 meets a fourth condition according to the number of type III warps in the warp type classification table 106, the units of register space required when the type III warp is executed, and the register occupancy status table 112. Details of the fourth condition are discussed below.
For the streaming multiprocessor SM0, it can be known from the register occupancy status table 112 that the remaining number of acceptable warps of the streaming processor SP0 is 4, and the register of the streaming processor SP0 has 5 units of space available. Therefore, from the perspective of the remaining number of acceptable warps, the streaming processor SP0 can further can accepts 4 type III warps; whereas from the perspective of the register space, the streaming processor SP0 can only accept 2 type III warps (because the type III warp requires 2 units of register space). So, comprehensively, at most 2 type III warps can be dispatched to the streaming processor SP0. Hence, at most 4 type III warps can be dispatched to the streaming processor SP1; at most 3 type III warps can be dispatched to the streaming processor SP2; and at most 4 type III warps can be dispatched to the streaming processor SP3. Finally, it can be seen from the above that for the streaming multiprocessor SM0, it can further accept 13 (i.e., 2+4+3+4) type III warps at most. Since the number of type III warps that the streaming multiprocessor SM0 can accept at most (i.e., 13) is greater than the number of the type III warps of the tread block TB0 recorded in the warp type classification table 106 (i.e., 5), the streaming multiprocessor SM0 meets the third fourth. Approaches for determining whether the streaming multiprocessor SM1 meets the second condition, third condition and the fourth condition are similar, and hence is not repeated below for the sake of brevity.
In the present embodiment, the TB dispatch module 114 according to round-robin scheduling, sequentially determines whether the streaming multiprocessor SM0 meets the first condition, the second condition, the third condition and the fourth condition, if all conditions are met, then the TB dispatch module 114 can directly dispatch the tread block TB0 to the streaming multiprocessor SM0. If the streaming multiprocessor SM0 does not meet all of the first condition, the second condition, the third condition and the fourth condition, then the TB dispatch module 114 continues to determine whether the streaming multiprocessor SM1 meets the first condition, the second condition, the third condition and the fourth condition, until it finds a streaming multiprocessor meets all of the first condition, the second condition, the third condition and the fourth condition or all the streaming multiprocessor has been checked. In certain embodiments, it is also feasible to find all the streaming multiprocessors that meets the first condition, the second condition, the third condition and the fourth condition, and then chose the appropriate streaming multiprocessor among them to accept the tread block TB0.
Assuming that the TB dispatch module 114 determines to dispatch the tread block TB0 to the streaming multiprocessor SM0, then it will notify the streaming multiprocessor SM0 and the warp dispatch module 116. The warp dispatch module 116 will dispatch the warps W0˜W14 of the tread block TB0 to the streaming processor SP0˜SP3 of the streaming multiprocessor SM0 according to the warp type classification table 106 and the register occupancy status table 112. Specifically, the warp dispatch module 116 can use the round-robin scheduling, and dispatch the warps W0˜W14 to the streaming processor SP0˜SP3 of the streaming multiprocessor SM0 according to the warp type classification table 106 and the remaining available register space and the remaining number of acceptable warps of each of the SPs SP0˜SP3 of the streaming multiprocessor SM0.
In
In
In
In
In
In
In
In
In
In
In
In
In
In
In
The warp dispatch module 116 informs the local dispatcher 122 of the streaming multiprocessor SM0 about the dispatching results so that the local dispatcher 122 of the streaming multiprocessor SM0 accordingly assigns the warps W0˜W14 to the respective registers 124 of the streaming processor SP0˜SP3 to complete the dispatching of the thread block TB0. Specifically, the local dispatcher 122 may calculate the corresponding register base addresses based on the warp type classification table 106.
The GPU and related methods of the present disclosure can allocate different sizes of space in the registers of the streaming processor to different types of warps according to the types of the warps when dispatching the warps to the streaming processor, thereby improving the efficiency of register allocation and making the dispatching of warps more flexible and spare.
The foregoing outlines features of several embodiments of the present application so that persons having ordinary skill in the art may better understand the various aspects of the present disclosure. Persons having ordinary skill in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Persons having ordinary skill in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alternations herein without departing from the spirit and scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202210533717.7 | May 2022 | CN | national |