The disclosed embodiments relate to the field of heterogeneous computing systems employing different types of processing units (e.g., central processing units, graphics processing units, digital signal processor or various types of accelerators) having a common memory address space (both physical and virtual). More specifically, the disclosed embodiments relate to the field of reducing or avoiding cold translation lookaside buffer (TLB) misses in such computing systems when a task is offloaded from one processor type to the other.
Heterogeneous computing systems typically employ different types of processing units. For example, a heterogeneous computing system may use both central processing units (CPUs) and graphic processing units (GPUs) that share a common memory address space (both physical memory address space and virtual memory address space). In general purpose computing using GPUs (GPGPU computing) a GPU is utilized to perform some work or task traditionally executed by a CPU. The CPU will hand-off or offload a task to a GPU, which in turn will execute the task and provide the CPU with a result, data or other information either directly or by storing the information where the CPU can retrieve it when needed.
While the CPUs and GPUs often share a common memory address space, it is common for these different types of processing units to have independent address translation mechanisms or hierarchies that may be optimized to the particular type of processing unit. That is, contemporary processing devices typically utilize a virtual addressing scheme to address memory space. Accordingly, a translation lookaside buffer (TLB) may be used to translate virtual addresses into physical addresses so that the processing unit can locate instructions to execute and/or data to process. In the event of a task hand-off, it may be likely that the translation information needed to complete the offloaded task will be missing from the TLB of the other processor type resulting in a cold (initial) TLB miss. To recover from a TLB miss, the task receiving processor must look through pages of memory (commonly referred to as a “page walk”) to acquire the translation information before the task processing can begin. Often, the processing delay or latency from a TLB miss can be measured in tens to hundreds of clock cycles.
A method is provided for avoiding cold TLB misses in a heterogeneous computing system having at least one central processing unit (CPU) and one or more graphic processing units (GPUs). The at least one CPU and the one or more GPUs share a common memory address space and have independent translation lookaside buffers (TLBs). The method for offloading a task from a particular CPU to a particular GPU includes sending the task and translation information to the particular GPU. The GPU receives the task and processes the translation formation to load address translation data into the TLB associated with the one or more GPUs prior to executing the task.
A heterogeneous computer system includes at least one central processing unit (CPU) for executing a task or offloading the task with a first translation lookaside buffer (TLB) coupled to the at least one CPU. Also included are one or more graphic processing units (GPUs) capable of executing the task and a second TLB coupled to the one or more GPUs. A common memory address space is coupled to the first and second TLB and is shared by the at least one CPU and the one or more GPUs. When a task is offloaded from a particular CPU to a particular GPU, translation information is included in the task hand-off from which the particular GPU loads address translation data into the second TLB prior to executing the task.
The embodiments will hereinafter be described in conjunction with the following drawing figures, wherein like numerals denote like elements, and
The following detailed description is merely exemplary in nature and is not intended to limit the disclosure or the application and uses of the disclosure. As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Thus, any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. All of the embodiments described herein are exemplary embodiments provided to enable persons skilled in the art to make or use the disclosed embodiments and not to limit the scope of the disclosure which is defined by the claims. Furthermore, there is no intention to be bound by any expressed or implied theory presented in the preceding technical field, background, brief summary, the following detailed description or for any particular computer system.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Numerical ordinals such as “first,” “second,” “third,” etc. simply denote different singles of a plurality and do not imply any order or sequence unless specifically defined by the claim language.
Additionally, the following description refers to elements or features being “connected” or “coupled” together. As used herein, “connected” may refer to one element/feature being directly joined to (or directly communicating with) another element/feature, and not necessarily mechanically. Likewise, “coupled” may refer to one element/feature being directly or indirectly joined to (or directly or indirectly communicating with) another element/feature, and not necessarily mechanically. However, it should be understood that, although two elements may be described below as being “connected,” similar elements may be “coupled,” and vice versa. Thus, although the block diagrams shown herein depict example arrangements of elements, additional intervening elements, devices, features, or components may be present in an actual embodiment.
Finally, for the sake of brevity, conventional techniques and components related to computer systems and other functional aspects of a computer system (and the individual operating components of the system) may not be described in detail herein. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent example functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in an embodiment.
Referring now to
While the CPUs 102 and GPUs 104 both utilize the same common memory (address space) 110, each of these different types of processing units have independent address translation mechanisms that in some embodiments may be optimized to the particular type of processing unit (i.e., the CPUs or the GPUs). That is, in fundamental embodiments, the CPUs 102 and the GPUs 104 utilize a virtual addressing scheme to address the common memory 110. Accordingly, a translation lookaside buffer (TLB) is used to translate virtual addresses into physical addresses so that the processing unit can locate instructions to execute and/or data to process. As illustrated in
Thus, when the CPU 102 or GPU 104 attempts to access the common memory 110 (e.g., attempts to fetch data or an instruction located at a particular virtual memory address or attempts to store data to a particular virtual memory address), the virtual memory address must be translated to a corresponding physical memory address. Accordingly, the TLB is searched first when translating a virtual memory address into a physical memory address in an attempt to provide a rapid translation. Typically, a TLB has a fixed number of slots that contain address translation data (entries), which map virtual memory addresses to physical memory addresses. TLBs are usually content-addressable memory, in which the search key is the virtual memory address and the search result is a physical memory address. In some embodiments, the TLBs are a single memory cache. In some embodiments, the TLBs are networked or organized in a hierarchy as is known in the art. However the TLBs are realized, if the requested address is present in the TLB (i.e., “a TLB hit”), the search yields a match quickly and the physical memory address is returned. If the requested address is not in the TLB (i.e., “a TLB miss”), the translation proceeds by looking through the page table 112 in a process commonly referred to as a “page walk”. After the physical memory address is determined, the virtual memory address to physical memory address mapping is loaded in the respective TLB 106 or 108 (that is, depending upon which processor type (CPU or GPU) requested the address mapping).
In general purpose computing using GPUs (GPGPU computing) a GPU is typically utilized to perform some work or task traditionally executed by a CPU (or vice-versa). To do this, the CPU will hand-off or offload a task to a GPU, which in turn will execute the task and provide the CPU with a result, data or other information either directly or by storing the information in the common memory 110 where the CPU can retrieve it when needed. In the event of a task hand-off, it may be likely that the translation information needed to perform the offloaded task will be missing from the TLB of the other processor type resulting in a cold (initial) TLB miss. As noted above, to recover from a TLB miss, the task receiving processor is required to look through the page table 112 of memory 110 (commonly referred to as a “page walk”) to acquire the translation information before the task processing can begin.
Referring now to
Accordingly, some embodiments contemplate enhancing or supplementing the task hand-off description (pointer) with translation information from which the dispatcher or scheduler 202 of the GPUy 104y can load (or pre-load) the TLBgpu 108 with address translation data prior to beginning or during execution of the task. In some embodiments, the translation information is definite or directly related to the address translation data loaded into the TLBgpu 108. Non-limiting examples of definite translation information would be address translation data (TLB entries) from TLBcpu 106 that may be loaded directly into the TLBgpu 108. Alternately, the TLBgpu 108 could be advised where to probe into TLBcpu 106 to locate the needed address translation data. In some embodiments, the translation information is used to predict or derive the address translation data for TLBgpu 108. Non-limiting examples of predictive translation information includes compiler analysis, dynamic runtime analysis or hardware tracking that may be employed in any particular implementation. In some embodiments translation information is included in the task hand-off from which the GPUy 104y can derive the address translation data. Non-limiting examples of this type of translation information includes patterns or encoding for future address accesses that could be parsed to derive the address translation data. Generally, any translation information from which the GPUy 104y can directly or indirectly load the TLBgpu 108 with address translation data to reduce or avoid the occurrences of cold TLB misses (and the subsequent page walks) is contemplated by the present disclosure.
Referring now to
Referring now to
A negative determination of decision 404 indicates that the translation information is not directly associated with the address translation data. Accordingly, decision 408 determines whether the offloading processor must obtain the address translation from the translation information (step 410). Such would be the case if the offloading processor needed to predict or derive the address translation data based upon (or from) the translation information. As noted above, address translation data could be predicted from compiler analysis, dynamic runtime analysis or hardware tracking that may be employed in any particular implementation. Also, the address translation data could be obtained in step 410 via parsing patterns or encoding for future address accesses to derive the address translation data. Regardless of the manner of obtaining that address translation data employed the TLB entries representing the address translation data are loaded in step 406. However, decision 408 could decide that the address translation data could not (or should not) be obtained (or attempted to obtain). Such would be the case if the translation information was discovered to be invalid or if the required translation is no longer in the physical memory space (for example, having been moved to a secondary storage media). In this case, decision 408 essentially ignores the translation information and the routine proceeds to begin the task (step 412).
To begin processing an offloaded task, the first translation is requested and decision 414 determines if there has been a TLB miss. If step 412 was entered via step 406, a TLB miss should be avoided and a TLB hit returned. However, if step 412 was entered via a negative determination of decision 408, it is possible that a TLB miss occurred, in which case a conventional page walk is performed in step 418. The routine continues to execute the task (step 416) and after each step determines whether the task has been completed in decision 420. If the task is not yet complete, the routine loops back to perform the next step (step 422), which may involve another address translation. That is, during the execution of the offloaded task, several address translations may be needed, and in some cases, a TLB miss will occur, necessitating a page walk (step 418). However, if execution of the task was entered via step 406, the page walks (and the associated latency) should be substantially reduced or eliminated for some task hand-offs. Increased efficiency and reduced power consumption are direct benefits afforded by the hand-off system and process of the present disclosure.
When decision 420 determines that the task has been completed, the task results are sent to the off-loading processor in step 424. This could be realized in one embodiment by responding to a query from the off-loading processor to determine if the task is complete. In another embodiment, the processor accepting the task hand-off could trigger an interrupt or send another signal to the off-loading processor indicating that the task is complete. Once the task results are returned, the routine ends in step 426.
A data structure representative of the computer system 100 and/or portions thereof included on a computer readable storage medium may be a database or other data structure which can be read by a program and used, directly or indirectly, to fabricate the hardware comprising the computer system 100. For example, the data structure may be a behavioral-level description or register-transfer level (RTL) description of the hardware functionality in a high level design language (HDL) such as Verilog or VHDL. The description may be read by a synthesis tool which may synthesize the description to produce a netlist comprising a list of gates from a synthesis library. The netlist comprises a set of gates which also represent the functionality of the hardware comprising the computer system 100. The netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks may then be used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the computer system 100. Alternatively, the database on the computer readable storage medium may be the netlist (with or without the synthesis library) or the data set, as desired, or Graphic Data System (GDS) II data.
The methods illustrated in
While exemplary embodiments have been presented in the foregoing detailed description, it should be appreciated that a vast number of variations exist. It should also be appreciated that the exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing the exemplary embodiments, it being understood that various changes may be made in the function and arrangement of elements described in the exemplary embodiments without departing from the scope as set forth in the appended claims and their legal equivalents.