This description relates to multi-threaded processors.
Techniques exist that are designed to increase an efficiency of one or more processors in the performance of one or more processes. For example, some such techniques are used when it is anticipated that a first processor may experience a period of latency during a first process, such as when the first processor is required to wait for retrieval of required data from a memory location. During such periods of latency, a second processor may be used to perform a second process, e.g., the second processor may be given access to a resource being used by the first processor during the first process. Additionally, or alternatively, the first processor and the second processor may implement the first and second processes substantially in parallel, without either processor necessarily waiting for a period of latency in the other processor. In the latter examples, then, it may occur that the first processor and the second processor both require use of, or access to, a shared resource (e.g., a memory), at substantially a same time. By interleaving operations of the first processor and second processor, and by providing fair access to the shared resource(s), both the first and second processes may be completed sooner, and more efficiently, than if the first and second processes were performed separately, e.g., in series.
In these and other examples, the processor(s) need not represent entirely separate physical processors that are accessing shared resources. For example, a single processor may switch between processes to achieve similar results. In a related example, a processor system implemented on a semiconductor chip may emulate a plurality of processors and/or perform a plurality of processes, by, e.g., duplicating certain execution elements for the processing. These execution elements may then be used to share various resources (e.g., memories, buffers, or interconnects), which themselves may be formed on or off the chip, in order to implement the first process and the second process.
According to one general aspect, priority information is set in a control register, the priority information being related to a first thread processor and a second thread processor. A first process is executed with the first thread processor and a second process is executed with the second thread processor. The first thread processor is prioritized in performing the first process relative to the second thread processor in performing the second process, based on the priority information as determined from the control register.
According to another general aspect, an apparatus includes a first thread processor that is operable to execute a first process, and a second thread processor that is operable to execute a second process. A control register is included that is operable to store priority information that is individually associated with at least one of the first thread processor and the second thread processor, the priority information identifying a restriction on a use of a shared hardware resource by the second thread processor during execution of at least one of the first process and the second process.
According to another general aspect, an apparatus includes a plurality of thread processors that are operable to perform a plurality of processes, a shared hardware resource used by the thread processors in performing the processes, a controller associated with the shared hardware resource and operable to receive contending requests for the shared hardware resource from the plurality of thread processors, and a control register associated with the shared hardware resource and operable to store priority information regarding use of the shared hardware resource by the plurality of thread processors. The controller is operable to receive the contending requests and access the control register to provide use of the shared hardware resource to a prioritized thread processor of the plurality of thread processors, based on the priority information.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
Thus, in the example of
Such elements or resources may include, for example, functional units that perform various activities associated with the processes. Although not specifically illustrated in
For example, an arithmetic-logic unit (ALU) may perform, or may enable the performance of, logic and arithmetic operations (e.g., to add, subtract, multiply, or divide). For example, an ALU may add the contents of a register, for storage of the results in another register. A floating-point unit and/or fixed-point unit also may be used for corresponding types of mathematical operations.
In implementing the first process and the second process, some of the elements or resources of the first thread processor 102 and the second thread processor 104 may be shared between the two, while others may be duplicated for partial or exclusive access thereof by the first thread processor 102 and the second thread processor 104. For example, shared resources (e.g., a shared memory) may be used by both of the first thread processor 102 and the second thread processor 104, while duplicated elements (e.g., instruction pointers for pointing to the next instruction(s) to be fetched) may be devoted entirely to their respective thread processor(s).
For example, in concurrent multi-threaded processors, a thread processor may have its own program counter, general register file, and general execution unit (e.g., ALU). Meanwhile, other execution units, such as, for example, floating point or application specific execution units (e.g., digital signal processing (DSP) units) may either be shared or may not be shared between the thread processors.
In the example of
In implementing the first and second processes, the first thread processor 102 and the second thread processor 104 may be required to contend for use of the shared hardware resource 106. For example, the first thread processor 102 and the second thread processor 104 may both attempt to access the shared hardware resource 106 at substantially a same time (e.g., within a certain number of processor cycles of one another). In such cases, requests for the shared hardware resource 106 from the first thread processor 102 and the second thread processor 104 may be received at a controller 108 for the shared hardware resource 106. For example, where the shared hardware resource 106 includes a cache, the controller 108 may include a cache controller that is typically used to provide cache access to one or more processors, and that is implemented to communicate with a control register 110.
As described in more detail below, e.g., with respect to
For example, based on instructions of the program 114, the priority information 112 may be programmed at a particular time to provide the first thread processor 102 with priority access to the shared hardware resource 106, during a designated job of the first process. Later, the priority information 112 may be re-set, such that the second thread processor 104 is provided with priority access to the shared hardware resource 106, during a later-designated job of the second process. In other words, the priority information 112 may be determined and/or altered dynamically, including during a run time of the first process and/or the second process of the program 114. In this way, for example, a desired one of the first thread processor 102 and the second thread processor 104 may be provided with a desired type and/or extent of prioritization, during particular times and/or situations that may be desired by a programmer of the program 114.
The priority information 112 may be set within the control register 110 by setting pre-designated bit patterns (also referred to as “priority bits”) within designated fields of the control register 110. In the example of
For example, the priority identifier field 116 may include one or more bits, where a value of the bit(s) as either a “1” or a “0” may indicate that either the first thread processor 102 or the second thread processor 104 currently has priority with respect to the shared hardware resource 106. Where more than two thread processors are used, an appropriate number of bits may be selected to indicate a thread processor that currently has priority with respect to access of the shared hardware resource 106.
The priority level field 118 also may include one or more bits, where the bits indicate a type or extent of priority, as just mentioned. For example, a bit pattern corresponding to a “fair” priority level may indicate that, notwithstanding the designation in the priority identifier field 116, no priority should be granted to either the first thread processor 102 or the second thread processor 104. That is, the priority level field 118 may indicate, for example, that access to the shared hardware resource 106 should be provided fairly to the first thread processor 102 and the 104, e.g., on a first-come, first-serve basis. In this case, if access requests from the first thread processor 102 and the second thread processor 104 for access to the shared hardware resource 106 are placed into buffers or queues (not shown in
Another priority level that may be indicated by a bit pattern within the priority level field 118 may be used to designate that when both of the first thread processor 102 and the second thread processor 104 attempt to access the shared hardware resource 106 at substantially a same time, the higher-priority thread processor (as designated in the priority identifier field 116) will be allowed the access, while the other thread processor waits for access. For example, access requests from the first thread processor 102 and the second thread processor 104 for access to the shared hardware resource 106 that are placed into a buffer or queue (not specifically shown in
Another level of priority that may be designated in the priority level field 118 refers to a restriction or limit placed on the access of the shared hardware resource 106 by the first thread processor 102 and/or the second thread processor 104. For example, where the shared hardware resource 106 includes a set-associative cache, the lower-priority of the first thread processor 102 and the second thread processor 104 ( as designated in the priority identifier field 116) may be restricted from re-filling a designated portion of the cache, after a cache miss by that lower priority thread processor. In another example, a cache may simply be partitioned so as to provide the higher-priority thread processor with a greater level of access. Further examples of how priority may be assigned with respect to a cache memory are provided below, e.g., with respect to
Finally in
When a plurality of shared hardware resources are used within a multi-thread processing system, the priority information 112 within the control register 110 may indicate whether, when, and to what extent each of the shared hardware resources should provide priority access to one of a plurality of thread processors attempting to access the shared hardware resource at a given time. According to the program 114, such priority indications may change over time as individual jobs of the processes of the program 114 are executed by the plurality of thread processors. Accordingly, high-priority jobs may be designated dynamically, and may be performed as quickly as if only a single thread processor were being used.
Once the priority information 112 is initially set and the first thread processor 102 and the second thread processor 104 are otherwise initialized/enabled (along with the shared hardware resource 106), then a first process may be executed with the first thread processor 102 (204), while a second process may be executed with the second thread processor 104 (206). In this regard, and consistent with the terminology used above with respect to
For example, once the priority information 112 is set and the processes are executed, the first thread processor 102 may be prioritized in executing a job of the first process (208). For example, as discussed above, the first thread processor 102 may gain priority access to the shared hardware resource 106 when contending with the second thread processor 104, as determined by the controller 108 from the priority information 112 in the control register 110. For example, where the shared hardware resource 106 includes a cache, and the first thread processor 102 and the second thread processor 104 execute overlapping requests for access thereto, then the controller 108 may check the priority information 112 to determine that the first thread processor 102 should be provided access to the cache. Similarly, where the shared hardware resource 106 includes a buffer or queue, and the first thread processor 102 and the second thread processor 104 both have access requests in the buffer/queue, then the controller 108 of the buffer/queue may check the priority information 112 to move the access request(s) of the first thread processor 102 ahead of the access request(s) of the second thread processor 104.
Put another way, the second thread processor 104 may be seen as being restricted in executing a job of the second process (210). For example, the second thread processor 104 may be seen as being partially restricted, in time and/or extent, from accessing the shared hardware resource 106 in any of the cache/buffer/queue examples just given. Further, a full restriction of the second thread processor 104 may be seen to occur when the halt field 120 is set to a halt position, in which case the second thread processor 104 will stop the execution of the second process until the halt field 120 is re-set from the halt position.
Once the job(s) of the first process and/or the second process are finished, then the priority information in the control register(s) may be re-set (212). For example, the program 114 may program the control register 110 to give a certain type or extent of priority to the first thread processor 102 during a first job of the first process, and, after the first job is completely, may dynamically re-program the control register 110 to give a different type or extent of priority to the first thread processor 102. Further, although not explicitly illustrated in
Thus, the execution of the first and second processes may continue (204, 206) with the new prioritization/restriction settings in place (208, 210), and with the priority information 112 being re-set (212), as appropriate (e.g., as mandated by the program 114). This cycle(s) may continue until the first process is finished (214) and the second process is finished (216).
For example, the chip 302 includes an instruction cache 304, a data cache 306, a translation look-aside buffer (TLB) 308, and one or more buffers and/or queues (310). These examples of the shared hardware resource 106, by themselves, are generally known to include certain functions and purposes. For example, the instruction cache 304 and the data cache 306 are generally used to provide program instructions and program data, respectively, to one or both of the first thread processor 102 and the second thread processor 104. Such separation of instructions and data is generally implemented to account for differences in how and/or when these two types of information are accessed. Meanwhile, the translation look-aside buffer 308 is used as part of a virtual memory system, in which a virtual memory address is presented to the translation look-aside buffer 308 and a corresponding cache (e.g., the instruction cache 304 or the data cache 306), so that cache access and virtual-to-physical address translation may proceed in parallel. Also, the buffer/queue 310 refers generally to one or more buffers and/or queues that may be used to store commands or requests, either to the instruction cache 304, the data cache 306, the translation look-aside buffer 308, or to any number of other elements that may be included on (or in association with) the chip 302.
Additionally, a system interface 312 allows the various on-chip components to communicate with various off-chip components, usually over one or more busses represented by a bus 314. For example, a memory controller 316 may be in communication with the bus 314, so as to provide access to a main memory 318. Thus, as is known, the instruction cache 304 and/or the data cache 306 may be used as temporary storage for portions of information stored in the main memory 318, so that an access time of the first thread processor 102 and the second thread processor 104 in obtaining such information may be improved. In this regard, it should be understood that a plurality of levels of caches may be provided, so that most-frequently accessed information may be accessed most quickly from a first level of access, while less-frequently accessed information may be stored at a second cache level (which may be located off of the chip 302). In this way, access to stored information may be optimized, and a need to access the main memory 318 is minimized. However, such multi-level caches, among various other elements, are not illustrated in the example of
The instruction cache 304, the data cache 306, the translation look-aside buffer 308, and the buffer/queue 310 include, respectively, controllers 320, 322, 324, and 326. As described above, such controllers may be used, for example, when one of the instruction cache 304, the data cache 306, the translation look-aside buffer 308, or the buffer/queue 310 receives substantially simultaneous or overlapping access or use requests from both the first thread processor 102 and the second thread processor 104. For example, if the instruction cache 304 receives such competing requests from the first thread processor 102 and the second thread processor 104, then the controller 320 may access appropriate fields of the control register 110 to determine a current state of the priority information 112 contained therein (as seen in
It should be understood that similar comments may apply to the system interface 312, the main memory 318, and other elements associated with the chip 302 that may or may not be illustrated in
Multiple techniques may be used to allow the elements associated with the chip 302 to determine the priority information 112 from the control register 110. For example, the controller 320 may receive a first request for access to the instruction cache 304 from the first thread processor 102, and a second request for access from the second thread processor 104, and may thus need to access the priority information 112 within the control register 110 to determine relevant priority information. In this case, the controller 320 may analyze the first request and the second request to determine a thread processor identifier associated with each request (so as to be able to correspond the requests with the appropriate thread processors), access corresponding priority fields within the control register 110, and then allow access to the access request associated with the higher-priority thread processor.
In the example of
Similarly, a the controller 322 is associated with a replicated control field 330, while the controller 324 is associated with a replicated control field 332, and the controller 326 is associated with a replicated control field 334. In this way, portions of the control register 110 that are relevant to the various shared hardware resources of the chip 302 are replicated in association with the corresponding ones of the shared hardware resources, so that priority decisions may be made quickly and reliably throughout the chip 302.
In other implementations, specific fields within the control register 110 may be wired directly to corresponding ones of the controller 320, the controller 322, the controller 324, and the controller 326, in which case no replication of control fields may be required. Such a direct wiring is illustrated in
Regarding the halt field 120, it should be understood that whichever of the first thread processor 102 or the second thread processor 104 currently is assigned a higher priority may be enabled to set the halt field 120 for the other thread processor to a halt setting, so that the other thread processor may be halted. As such, both of the first thread processor 102 and the second thread processor 104 should be understood to be wired to, or otherwise in communication with, the control register 110. In this way, for example, an action of the first thread processor 102 in setting the halt field 120 for the second thread processor 104 to a halt setting (again, assuming for the example that the first thread processor 102 has the priority/authorization to do so) is automatically and quickly propagated to the second thread processor 104, and the second thread processor 104 will be halted until the first thread processor 102 re-sets the halt field 120 for the second thread processor 104 to remove the halt setting (e.g., by switching a halt bit from “1” back to “0,” or vice-versa).
In general terms, then, for example, the first thread processor 102 may issue a request for data from the data cache 306, by sending a memory address to the data cache 306. The data cache 306 may then attempt to match the memory address within an address of the data cache 306, and, if there is a match (also referred to as a “hit”), then data within the data cache 306 associated with the memory address is read from the data cache 306. On the other hand, if there is not a match (also referred to as a “miss”), then the first thread processor 102 and/or the data cache 306 must request data from the memory address from the main memory 318. However, as already mentioned, it may be inefficient to obtain only the requested data from the main memory 318, and, instead, the requested data is retrieved from the main memory 318 together with a block of related data, all of which may then be stored in the data cache 306.
In attempting to match the requested memory address within the data cache 306, it should be understood that trying to match the requested memory address to any possible address within the data cache 306 may be relatively time-consuming, and may at least partially offset the advantage of using the data cache 306 in the first place. Therefore, in
In this way, a requested memory address is first limited to the index 410b, and only the four lines 412, 414, 416, and 418 then need to be checked for a match with the memory address (using the entirety of the memory address). If a match is found, then the corresponding data is read from the corresponding line (where the data may occupy a relatively small area of the line). If, however, a match is not found (i.e., a “miss” occurs), then an entire line (e.g., the line 414) may typically be replaced by obtaining from the main memory 318 both the requested data (i.e., data from the main memory 318 at the provided memory address) and an associated quantity of data from the main memory 318 that is related to the requested data and sufficient to fill the line.
In
Thus, if the first thread processor 102 requests data from the data cache 306, and it is determined from (designated bits of) the requested memory address that the requested data should be contained in the index 410b, then only the lines 412, 414, 416, and 418 need be checked for the full memory address/data. If the requested data is present (a “hit”), then the requested data is retrieved. If not (a “miss”), then the requested data is obtained together with additional, related data from the main memory 318, and is used to fill the line 412.
On the other hand, in a similar scenario with the second thread processor 104, the line 412 may not be re-filled after such a cache miss, since the line 412 is included in the way-1402 that is reserved for re-fill by the first thread processor 102. Therefore, the second thread processor 104 would re-fill one of the remaining lines 414, 416, or 418 from the corresponding ways 404, 406, or 408.
By partitioning the data cache 306 in this manner, or a related manner, a higher-priority thread processor may be more likely to have required data within the data cache 306. For example, and by comparison, in a case where both the first thread processor 102 and the second thread processor 104 are sharing fair and equal access to the data cache 306, it may be the case that the first thread processor 102 has access to the data cache 306 for a period of time, during which various cache hits and misses may occur. For each miss, as described, corresponding data (generally related to the first process of the first thread processor 102) is read from the main memory 318 and used to re-fill one or more cache lines. Later, the second thread processor 104 may gain access to the data cache 306, and, may experience a number of cache misses (since the data cache 306 has just been filled with data pertinent to the first process of the first thread processor 102), thereby causing the data cache 306 to re-fill with data related to the second process of the second thread processor 104. As the first thread processor 102 and the second thread processor 104 alternate access, then, both the first thread processor 102 and the second thread processor 104 may experience inordinate delays as their respective data is retrieved from the main memory 318.
Using the partitioning scheme of
It should be understood that the example of
Priority information may then be set (508). For example, an initial programming of the priority information 112 within the control register 110 may occur, and may be propagated to the respective shared hardware resources using either a direct wiring and/or the replicated control fields of
For example, in one implementation, a first job of the first process may occur (510), while priority bits may be set within the halt field 120 so as to indicate that the second thread processor 104 should be halted during this first job (512). In this way, the first thread processor 102 may complete the first job of the first process, very quickly, as if the first thread processor 102 were the only thread processor present in a processing system.
Then, a second job of the first process may be executed (514), while the first thread processor 102 is provided with priority access to any available shared hardware resources (516). In other words, the priority information 112 may be re-set as described above with respect to
Once the second job of the first process is completed, the priority information 112 may be re-set again, such that the first thread processor 102 is halted during a first job of the second process (518), while the second thread processor 104 executes the first job of the second process (520). Then, the second thread processor 104 may be provided with priority access to any shared hardware resources (522) while the second thread processor 104 executes a second job of the second process (524).
Thus, it should be understood that the priority information 112, or any portion thereof, may be set or re-set at virtually any point of the first or second process, according to the program 114. Also, portions of the priority information 112 may be set statically, and maintained through the most or all of the first or second process.
It should be understood that the terms “first job” or “second job” in
In this case, it may first be determined whether the process(es) require a halt of one of the thread processors (604), e.g., the second thread processor 104. If so, then a halt bit corresponding to the second thread processor 104 may be set within the halt field 120 may be set (606), in which case the second thread processor 104 will be caused to cease operations. If a restart is not determined (608), then the halting of the halted thread processor 104 continues until a restart is, in fact, permitted (e.g., as set by the program 114). Then, the second thread processor 104 may be restarted (610), and the execution of the process(es) may continue with, if necessary, a re-setting of the priority information 112 within the control register 110 (602).
In this example, once the priority information 112 is re-set, then no halt may be required (604). Instead, it may be determined whether requests from the first thread processor 102 and the second thread processor 104 have both been placed into a buffer and/or queue (612), such as the buffer/queue 310. If not, then it may somewhat similarly be determined whether substantially simultaneous requests have been received at a given shared hardware resource(s) (614). If so, and/or if requests from the first thread processor 102 and the 104 have been placed into the buffer/queue 310, then the control register 110 may be checked for relevant priority information 112 (616).
More specifically, as discussed above with respect to
Once the access is finished (620), then it is permissible to allow access to the other, lower-priority thread processor (622), e.g., the second thread processor 104. In case of a cache miss by the second thread processor 104 (624), then the second thread processor 104 will operate to re-fill a cache line of the cache from the main memory 318, but, as described with respect to
Finally, the priority information may be re-set and provided to the shared hardware resources, and the process(es) may continue (602) accordingly. In this way, at least some priority information may be set and re-set dynamically, so that the flow 600 may occur differently at different times (e.g., for different jobs) of the first and second processes.
It should be understood that the flow 600 is not intended necessarily to represent a literal or temporal sequence of events, since, for example, some of the operations may occur in parallel, and some of the operations may occur in a different order than that described and illustrated. Further, other operations also may be included, since, for example, a re-setting of the priority information (602) may cause a priority level in the priority level field 118 to indicate “fair” priority, in which case neither the first thread processor 102 or the second thread processor 104 may be able to receive priority access to shared hardware resources and/or set a halt bit for the other thread processor in the halt field 120. Similar comments are also applicable to the flowcharts 200 and 500 of
Although the above description is provided using the included terminology and examples, it should be understood that other terminology and examples also may be applicable. For example, other terminologies and examples for/of the systems of
Similarly, although the examples are provided in terms of a single chip having the first thread processor 102 and the second thread processor 104, it should be understood that more than two thread processors may be used. Additionally, or alternatively, two or more physical processors, perhaps on more than one chip, may be used to implement the techniques described herein.
Also, although the examples of
Thus, as described, prioritized access to shared hardware resources may be provided to a thread processor, to one degree or another. In some implementations, complete prioritization is provided simply by halting an operation of another thread processor(s) for some determined time. In this way, the prioritized thread processor may operate quickly and reliably, and may provide results that are comparable to a case of a single (not multi-threaded) processor.
While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the embodiments of the invention.