The present disclosure is generally related to data processing, and more particularly, to memory stack operations.
II. DESCRIPTION OF RELATED ART
Advances in technology have resulted in more powerful computing devices that may be used to perform various tasks. Computing devices typically include at least one processor. To improve the processing of instructions within computing devices, branch prediction is sometimes used by processors to anticipate branching within program code. A Return Address Stack (RAS) can be used by a microprocessor to predict return addresses for call/return type branches encountered during program execution. A RAS may be a special implementation of a stack (a last in first out (LIFO) data structure). Call instructions may cause a return address to be pushed onto the RAS. Return instructions may cause an address located at the top of the RAS to be popped off the stack and used to predict a return address. When the branch prediction is successful, the address popped off the top of the RAS may be used to redirect an instruction fetch unit of the processor. When the branch prediction is unsuccessful, a correct return address may be calculated in order to correctly redirect the instruction fetch unit.
Each thread of a multithreaded processor may use its own dedicated RAS and each RAS may have a fixed size. Each thread may use a different number of entries of their fixed-size RAS depending upon the instruction stream being executed by each thread. However, the use of multiple return address stacks adds cost to the system.
A particular embodiment may allow multithreaded access to a shared stack. Access to the shared stack may be dynamically allocated to multiple threads according to a current demand of each thread. Entries of the shared stack may be assigned and overwritten by the threads through the manipulation of thread pointers. Because thread pointers are moved instead of moving the stack address entries, processing and power demands may be reduced. The shared stack may include fewer entries than conventional, static multithreaded return address stacks because entries of the shared stack may be more efficiently distributed among the threads. Fewer stack entries may translate into smaller space demands on a die, as well as increased performance.
In a particular embodiment, an apparatus is disclosed that includes a shared stack. The apparatus further includes a controller operable to selectively allocate a first portion of the shared stack to a first thread and to selectively allocate a second portion of the shared stack to a second thread.
In another particular embodiment, a method of managing a stack shared by multiple threads of a processor includes allocating a first portion of a shared stack to a first thread and allocating a second portion of the shared stack to a second thread.
In another particular embodiment, a computer readable tangible medium stores instructions executable by a computer. The instructions include instructions that are executable by the computer to allocate a first portion of a shared stack to a first thread and to allocate a second portion of the shared stack to a second thread.
Particular advantages provided by at least one of the disclosed embodiments may result from the shared stack being configured for shared access by multiple threads. The shared stack may include fewer total entries than would be conventionally required by a system whose threads each access a separate stack. A smaller total number of entries may result in smaller space demands on a die, as well as reduced complexity and power demands. Reduced stack size and other advantages facilitated by one or more of the disclosed embodiments may further improve memory distribution and stack access efficiency.
Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims.
A system is disclosed that includes a memory stack that is shared by multiple threads of a multi-threaded processor. Entries of the shared stack may be dynamically allocated to a given thread by a stack controller. With such allocation, each thread's allocated portion of the shared stack is sized for efficient stack use and in consideration of fairness policies. The shared stack may include fewer stack entries than would be used by separate stacks that are each dedicated for a distinct thread.
Referring to
The multithreaded processor 160 may include one or more multithreaded processor cores 106 that may execute one or more threads. For example, the multi-threaded processor 160 may execute a first thread 108 (thread 0), a second thread 110 (thread 1), a third thread 112 (thread 2), and an nth thread 114 (thread (n−1)). The multithreaded processor 160 may send stack operation requests to the controller 102 for one or more of the threads 108-114. For example, the multithreaded processor 160 may send a push request 138 to the controller 102 identifying a thread generating the push request 138 and further identifying data to be pushed onto the requesting thread's portion of the shared stack 105. As another example, the multithreaded processor 160 may send a pop request 136 to the controller 102. The pop request 136 may identify a requesting thread of the multiple threads 108-114 and request a return of the data that was most recently added to the requesting thread's portion of the shared stack 105.
The controller 102 is operative to selectively allocate portions of the shared stack 105 to individual threads. For example, the controller 102 may selectively allocate a first portion 116 of the shared stack 105 to the first thread 108, a second portion 118 of the shared stack 105 to the second thread 110, a third portion 120 of the shared stack 105 to the third thread 112, and an nth portion 122 of the shared stack 105 to the nth thread 114. In a particular embodiment, the shared stack 105 may include a hardware stack. Alternatively, the shared stack 105 may be implemented in other memory, such as random access memory (RAM).
The controller 102 dynamically allocates the portions 116-122 of the shared stack 105 in response to thread requests subject to a fairness policy 140 and maintains data indicating allocation and usage of the allocated portions 116-122 via stack pointers 142. For example, the fairness policy 140 may indicate that the shared stack 105 is allocated amongst threads equally, in fixed-size portions for each thread, in fixed-size portions for each executing thread, in variable-sized portions for each thread, in variable-sized portions for each executing thread, in weighted average-sized portions with respect to a number of stack misses, or in some other fashion. The stack pointers 142 include pointers 162 corresponding to the first thread 108, pointers 164 corresponding to the second thread 110, pointers 166 corresponding to the third thread 112, and pointers 168 corresponding to the nth thread 114. For example, each of the thread pointers 162-168 may include a top entry pointer, a top-of-stack pointer, a bottom-of-stack pointer, a bottom entry pointer, and one or more additional status bits, as described with respect to
The first portion 116 may be allocated for stack operations corresponding to the first thread 108. The first portion 116 may include one or more stack entries (not shown) that may store one or more data elements 124 of the first thread 108. The second portion 118 may be allocated for stack operations of the second thread 110 and may include one or more stack entries that may store one or more data elements 126 of the second thread 110. The third portion 120 may be allocated for stack operations corresponding to the third thread 112 and may include one or more stack entries that may store one or more data elements 128 of the third thread 112. The nth portion 122 may be allocated for stack operation corresponding to the nth thread 114 and may include one or more stack entries that may store one or more data elements 130 of the nth thread 114.
During operation, one or more threads of the multithreaded processor 160 may generate a stack operation request, such as the push request 138. The controller 102 may receive the push request 138 and determine whether the requesting thread (e.g., the first thread 108) has sufficient space in its allocated portion (e.g., the first portion 116) to accommodate addition of data (e.g., to the data elements 124 of the first thread 108). If the first portion 116 does not include sufficient space, the controller 102 may determine whether to dynamically allocate additional space to the first thread 108 in accordance within the fairness policy 140. For example, when the other threads 110-114 are not using the remainder of the shared stack 105, the controller 102 may dynamically allocate additional memory to the first portion 116 to accommodate additional push requests corresponding to the first thread 108.
Alternatively, when multiple threads 110-114 are actively using the shared stack 105, the controller 102 may prevent the first thread 108 from being allocated more than a fair share of the shared stack 105, as determined by the fairness policy 140. Each portion of the shared stock may be implemented using a circular buffer. When the controller 102 determines that the push request 138 can be satisfied, the controller 102 initiates memory operations 150 that may include writing received data for the push request 138 to one of the data elements 124 of the first thread 108 within the first portion 116. The controller 102 may also update the first thread pointers 162 to indicate a change of a top entry of the stack when the first portion has been resized.
The controller 102 may also be configured to receive a pop request 136 from any of the threads, such as from the first thread 108. In response to the pop request 136, the controller 102 is configured to initiate memory operations 150 to read a most recently added data element from the data elements 124 of the first thread 108 and to return data 134 to the multithreaded processor 160 in response to the pop request 136. For example, when the shared stack 105 is implemented as a function return address stack (RAS), the return data 134 includes a function return address corresponding to the requested thread, such as the first thread 108. In addition, the controller 102 may update the first thread pointers 162 to indicate that the top-of-stack has been popped and to reduce an allocation of stack space usage within the first portion 116 in response to fulfilling the pop request 136.
In a particular embodiment where the controller 102 predicts the function return address via a first instruction, a mispredicted function return address from the shared stack 105 results in a corrected function return address being subsequently returned. For example, the multithreaded processor 160 may receive the pop request 136 in the form of an instruction to be executed (e.g., a return instruction). The return execution may take multiple cycles to complete. At the start of the execution of the return instruction, a return address may be popped from the shared stack 105. This popped return address may be a speculative address that is used to redirect instruction fetches. As the return instruction continues to execute (e.g., makes its way through a pipeline of the multithreaded processor 160), the multithreaded processor 160 may calculate the “actual” (e.g., non-speculative) return address. When the return instruction has completed execution, the speculative and the non-speculative return addresses may be compared. When the return addresses do not match (e.g., a branch mispredict has occurred), the “actual” return address may be used to fetch instructions. The “incorrectly” fetched instructions may be flushed.
As a result of the controller 102 implementing the fairness policy 140 to dynamically allocate portions of the shared stack 105 in response to a push operation request 138, the shared stack 105 may be used by multiple threads 108-114 and dynamically allocated, such that each thread 108-114 is provided a distinct portion of the shared stack 105 that operates as an individual stack for efficient utilization of the total stack entries within the shared stack 105. Allocation and de-allocation of portions 116-122 of the shared stack 105 between the individual threads 108-114 may be performed through operation of the controller 102 and by control of the stack pointers 142, such that data elements 124-130 need not be moved within the shared stack 105. Instead, the allocated portions 116-122 of the shared stack 105 may be “moved” around the data elements 124-130. As a result, the shared stack 105 may be efficiently used and may accommodate stack operations from the multiple threads 108-114 and may use fewer total stack entries than a system that includes separate dedicated stacks for each thread 108-114.
Although four threads and four portions are shown in
Referring to
The thread 112 (e.g., thread 2) may include a top entry pointer (TEP) 252 to which the thread 112 may write. The top entry pointer 252 may point to a top-most entry within the shared stack 105. The thread 112 may also include a top of stack pointer (TSP) 254. The top of stack pointer 254 may point to a top, or most recently written, entry within the shared stack 105 associated with the thread 112. The most recently written entry may include data that would be returned upon a stack pop operation. A bottom of stack pointer (BSP) 256 included within the thread 112 may point to an oldest valid entry within the shared stack 105 that is associated with the thread 112. The thread 112 may further include a bottom entry pointer (BEP) 258. The bottom entry pointer 258 may point to a bottom-most entry of the shared stack 105 to which the thread 112 may write. The thread 112 may further include a thread active status bit (TASB) 260. The thread active status bit 260 may indicate if the thread 112 is executing a process. If the thread 112 is not executing the process, the thread 112 may not be allocated an entry of the shared stack 105.
The thread 112 further includes a wrapped stack bit (WSB) 264. The wrapped stack bit 264 may indicate whether the shared stack 105 has wrapped in response to the top of stack pointer 254 and/or bottom of stack pointer 256 advancing to point to a common entry. For example, the wrapped stack bit 264 may be set to a logic one value to indicate that the top of stack pointer 254 has wrapped around to reach the bottom of stack pointer 256. Where the wrapped stack bit 264 is set to a logic one value, a subsequent push may cause both the bottom of stack pointer 256 and top of stack pointer 254 to increment. The push operation may additionally cause the oldest entry in the shared stack (pointed to by the bottom of stack pointer 256) to be overwritten. Where a pop operation from the shared stack 105 occurs while the wrapped stack bit 264 of the thread has the logic one value, the wrapped stack bit 264 may transition to a zero value.
The thread 112 also includes an empty/full bit (EFB) 262. The empty/full bit 262 may be configured to indicate whether an entry of the shared stack 105 associated with the thread 112 is used. For example, the empty/full bit 262 may include a logic one value when data of the thread 112 is present in the shared stack 105. Conversely, the empty/full bit 262 may include a logic zero value when no data of the thread 112 is stored within the shared stack 105. Where the empty/full bit 262 is a logic zero value and a push operation is made, the empty/full bit 262 may transition to a logic one value. The empty/full bit 262 may transition to a logic zero value if a pop operation occurs while the top of stack pointer 254 and the bottom of stack pointer 256 both point to the same entry and while the wrapped stack bit 264 equals the logic zero value.
Generally, threads may be allocated or “own” entries of the shared stack 105 that are bounded by their top entry pointer and their bottom entry pointer. A thread may write data to such “owned” entries. In addition, threads may be considered to “utilize” entries between their top of stack pointer and their bottom of stack pointer. Thus, entries of the shared stack 105 may be owned by a particular thread, owned and utilized by a particular thread, or neither owned nor utilized by a particular thread.
Similar to the thread 112, the thread 110 (i.e., thread 1) may include a top entry pointer 238 and a top of stack pointer 240. The thread 110 may further include a bottom of stack pointer 242 and a bottom entry pointer 244. A thread active status bit 246, an empty/full bit 248, and a wrapped stack bit 250 may also be included within the thread 110.
The thread 108 (i.e., thread 0) may include a top entry pointer 224 and a top of stack pointer 226. The thread 108 may further include a bottom of stack pointer 228 and a bottom entry pointer 230. A thread active status bit 232, an empty/full bit 234, and a wrapped stack bit 236 may also be included within the thread 108.
The shared stack 105 may include entries 210, 212, 214, 216, 218, 220, and 222 that store memory addresses of instructions. The entries 210, 212, 214, 216, 218, 220, and 222 may be configured to be shared by the threads 108, 110, and 112. During a single threaded operation, the thread 112 may become active (i.e., begin to execute a process). The top entry pointer 252, the top of stack pointer 254, the bottom of stack pointer 256, and the bottom entry pointer 258 may be set to point to the entry 222, as indicated by the dashed lines 266. The thread 112 may be said to “own” the entry 222 (e.g., the thread 112 may write data to the entry 222).
The empty/full bit 262 and the wrapped stack bit 264 bit may be set to indicate an empty condition (e.g., a logic zero value), and the thread active status bit 260 may be set to indicate an active condition (e.g., a logic one value). As calls are made by the process executing on the thread 112, an address may be pushed on the shared stack 105 into the entry 222 indicated by the top of stack pointer 254. The address may indicate a return address.
Where a subsequent call is made by the process executing on the thread 112, another address may be pushed onto the shared stack 105 into the entry 220 above the entry 222 indicated by the top of stack pointer 254. The push may cause the top entry pointer 252 and the top of stack pointer 254 to dynamically increment to point to the newly written entry 220. The incrementing may thus allocate to the thread 112 one or more additional entries of the shared stack 105. Where the process executing on the thread 112 indicates a return, a pop operation from the shared stack 105 may occur to retrieve a return predicted address. The pop operation may cause the top of stack pointer 254 of the thread 112 to decrement, but the top entry pointer 252 may not decrement.
A branch mispredict may be caused by an inaccurate return address being popped from the shared stack 105. In response, all data stored in the stack 105 and associated with the mispredicting thread may be discarded and stack construction for the thread may be restarted. For example, when the thread 112 mispredicts, the reconstruction for the thread 112 may be initiated by setting the top of stack pointer 254 equal to the bottom of stack pointer 256, setting the state of the empty/full bit 262 to indicate an empty state, and setting the wrapped stack bit 264 to indicate no wrapping. Pop requests may be avoided while the empty/full bit 262 equals a zero value, which may save processing power.
The top entry pointer 252 may point to the entry 216, as indicated by the dashed line 270. As indicated by the dashed line 268, the top of stack pointer 254 may point to the entry 220. Both the bottom of stack pointer 256 and the bottom entry pointer 258 point to the entry 222, as indicated by the dashed lines 266. As such, the thread 112 may be allocated, or “own,” the entries 216, 218, 220, and 222, as bounded by the line 270 (logically pointing from the top entry pointer 252) and the line 266 (logically pointing from the bottom entry pointer 258).
Upon becoming active (e.g., actively executing a process), the thread 110 may anchor itself within the entry 272 of the shared stack 105. The thread 110 may push a data element into the shared stack 105 within the entry 272. In a particular embodiment, the entry 272 may have been previously unallocated to, or unclaimed by, another thread. In another particular embodiment, the entry used to anchor the thread may previously have been allocated to another thread. As indicated by the dashed line 276 of
The thread 110 may claim another entry and push an additional data element onto the claimed entry of the shared stack 105 into entry 274.
Referring to
The thread 112 may own the entries of the shared stack 105 from 214 indicated by the dashed line 270 (that logically points from the top entry pointer 252) to the entry 220 indicated by the dashed line 266 (that logically points from the bottom entry pointer 258).
Similarly, the thread 110 may own the entries of the shared stack 105 bounded by the entry 222 (indicated by the line 276 that logically points from the top entry pointer 238) and wrapping around the stack 105 to include the entries 210 and 212 (indicated by the line 286 that logically points from the bottom entry pointer 244). The thread 110 may be said to be wrapped around within the shared stack 105 because the bottom of stack pointer 242 is above the top of stack pointer 240. The thread 110 thus owns entries 210, 212, and 222. The portion of the shared stack 105 owned by the thread 110 is thus contiguous with the thread 112, and no free entries exist for the thread 110 to claim in response to a push.
Where the thread 110 performs a push onto the shared stack 105 and the top of stack pointer 240 is pointing at the same entry as the top entry pointer 238, the data may be pushed to one of two locations. For example, the thread 110 may increment both the top of stack pointer 240 and the top entry pointer 238, thereby claiming a new entry and more stack space to store the newly pushed data, as described in
Where the entry 220 above the entry 222 pointed to by the top entry pointer 238 is unclaimed, the thread 110 may claim the entry 220 by updating the top entry pointer 238. The thread 110 may further set the top of stack pointer 240 to point at the same entry 220.
Where the entry 220 above the entry 222 pointed to by the top entry pointer 238 is claimed by another thread, and the thread 110 already owns at least its fair share of the stack 105, a subsequent push may not allow the thread 110 to claim the additional stack entry 220. Instead, the top of stack pointer 240 may wrap around from the top entry pointer 238 to the bottom entry pointer 244.
Where the entry 220 above the entry 222 pointed to by the top entry pointer 238 is claimed by another thread, and the thread 110 does not yet own its fair share of the shared stack 105, a push may be allowed to claim an entry 220 from another thread 112. The entry 220 may be claimed by updating the bottom entry pointer 258 of the thread 112 and updating the top entry pointer 238 of the thread 110. Pointers of the thread 112 may also be updated to reflect the “loss” of the entry 220, as described with reference to
As illustrated in
Referring to
The thread 112 may own the entries of the shared stack 105 from 214 bounded by the dashed line 270 (that logically points from the top entry pointer 252) to the entry 218 bounded by the dashed line 280 (that logically points from the bottom entry pointer 258). The thread 112 may have lost the entry 220 to the thread 110. Thus, in the particular embodiment illustrated in
Similarly, the thread 110 may own the entries of the shared stack 105 bounded by the entry 220 (indicated by the line 276 that logically points from the top entry pointer 238) and wrapping around the stack 105 to include the entries 210 and 212 (indicated by the line 286 that logically points from the bottom entry pointer 244). The thread 110 thus owns entries 210, 212, 220, and 222. The thread 112 has lost the entry 220 to the thread 110.
According to a particular embodiment, a thread 112 that loses a valid entry 220 (e.g., an entry located between a bottom of stack pointer 256 and a top of stack pointer 254) may invalidate any entries older than the entry 220 that is lost. To this end, the thread 112 may set its bottom of stack pointer 256 to point towards the same entry as the newly incremented bottom entry pointer 258.
A first portion of a shared stack may be allocated to a first thread, at 702. For example, the first portion 116 of the shared stack 105 of
A second portion of the shared stack may be allocated to a second thread, at 704.
For example, the second portion 118 of the shared stack 105 of
A third portion of the shared stack may be allocated to a third thread, at 706.
For example, the third portion 120 of the shared stack 105 of
The shared stack may be used to predict a return address, at 708. For example, the shared stack 105 of
It should be noted that although the method 700 of
A shared stack may be formed within a portion of a memory, at 802. For example, the shared stack 105 of
A first portion and a second portion of the shared stack may be dynamically allocated, at 804. For example, the first portion 116 of the shared stack 105 of
The size of the first portion or the second portion, or both, may be adjusted, at 806. For example, the size of the first portion 116 may be dynamically adjusted according to a request of the first thread 108. Pointers 162 of the first thread 108 may point to entries of the first portion 116 to shrink or grow the first portion 116 to meet the request. Similarly, the size of the second portion 118 may be dynamically adjusted according to a request of the second thread 110. Pointers 164 of the second thread 110 may point to entries of the second portion 118 to shrink or grow the second portion 118 to meet the request.
Alternatively, a size of the first portion, the second portion, or both, may be fixed, at 808. For example, a size of the first portion 116 of
The size of the first portion allocated to the first thread may be limited according to a fairness policy, at 810. For example, the size of the first portion 116 allocated to the first thread 108 may be limited according to the fairness policy 140 of
The first thread and the second thread may be executed with the processor, at 812. For example, one of the multithreaded processor cores 106 may execute the first thread 108 and the second thread 110.
The first thread may be allowed to push or pop a return address with respect to an element of the first portion of the shared stack, at 814. For example, the first thread 108 may be allowed to perform a pop 136 or a push 138 of a return address at one of the data elements 124 of the first portion 116.
The second thread may be allowed to perform a push or pop of a return address with respect to an element of the second portion of the shared stack, at 816. For example, the second thread 110 may be allowed to perform a pop 136 or a push 138 of a return address at the second data element 126 of the second portion 118.
A data element within the second portion may be overwritten using the first thread, at 818. For example, the second data element 126 of the second portion 118 may be overwritten using the first thread 108. To illustrate, the first thread 108 may execute a push 138 that causes an oldest entry in the second portion 118 to be overwritten.
Referring to
The memory 932 may be a computer readable tangible medium storing instructions 950 whose execution by a computer results in the allocation of a first portion of a shared stack to a first thread. For example, the instructions 950 may be executed by the multithreaded processor 910, causing the allocation of a first portion of the shared stack 964 to a first thread. The instructions 950 may further be executed causing the allocation of a second portion of the shared stack 964 to a second thread.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithms described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The pointers of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or other forms of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description of the disclosed embodiments is provided to enable a person skilled in the art to make or use the disclosed embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.