The present disclosure generally relates to processing devices, and more particularly relates to multiple program executions within a processing device.
The market for portable devices, for example, mobile phones, smart watches, tablets, etc., is expanding with many more features and applications. As the number of applications on these devices increases, there also is an increasing demand to run multiple applications concurrently. More features and applications call for microprocessors to have high performance, but with low power consumption. Multithreading can contribute to high performance in this new realm of application. Keeping the power consumption for the microprocessor and related cores and integrated circuit chips near a minimum, given a set of performance requirements, is desirable, especially in portable device products.
Multithreading is the ability to pursue two or more threads of control in parallel within a microprocessor pipeline. Multithreading is motivated by low utilization of the hardware resource in a microprocessor. In comparison, multi-core is fairly wasteful of the hardware resource. Multithreading can, in general, provide the same performance as multicore without duplicating of resources.
Multithreading can be used in an effort to increase the utilization of microprocessor hardware and improve system performance. Multithreading is a process by which two or more independent programs, each called a “thread,” interleave execution in the same processor, which is not a simple problem. Each program or thread has its own register file, and context switching to another program or thread requires saving and restoring of data from a register file to a memory. This process can consume much time and power. These and other problems confront attempts in the art to provide efficient multithreading processors and methods.
In some embodiments of the present disclosure, certain auxiliary registers of a microprocessor are pre-programmed such that thread switching is performed in hardware without any software intervention or external intervention.
Example embodiments of the present disclosure include configurations that may include structures and processes within a microprocessor. For example, a configuration may include allocating a set of mailbox registers to each thread of a plurality of threads for execution in the microprocessor, including, in a field of a mailbox register in the set of mailbox registers, an identifier of a next thread of the plurality of threads to be executed in the microprocessor upon thread switching, and switching execution of the thread to execution of the next thread based upon a thread switch condition indicated in the mailbox register and the identifier of the next thread.
Example embodiments of the present disclosure include configurations that may include structures and processes within a microprocessor. For example, a configuration may include a set of mailbox registers allocated to each thread of a plurality of threads for execution at the microprocessor, and one or more auxiliary registers allocated to one or more of the plurality of threads. A mailbox register in the set of mailbox registers allocated to each thread comprises an identifier of a next thread of the plurality of threads to which that thread switches based on satisfying a thread switch condition indicated in the mailbox register. The one or more auxiliary registers configure at least one of: a number of threads in the plurality of threads for execution at the microprocessor, a priority for thread switching, storing a program counter (PC) of each thread, or storing states of registers of each thread.
The figures depict embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles, or benefits touted, of the disclosure described herein.
Embodiments of the present disclosure relate to different types of multithreading that can be employed at a microprocessor. The coarse-grain multithreading refers to a multithreading when a thread switches on Level-2 (L2) or Level-3 (L3) cache misses, i.e., on very long latency instruction(s). The fine-grain multithreading refers to a multithreading approach when there is a dedicated cycle for each thread, which may reduce or eliminate load-to-use latency penalty for load instructions. The simultaneous multithreading (SMT) refers to a multithreading approach when each thread can be in any pipeline stage at any time, which may be suitable to an out-of-order superscalar microprocessor.
Coarse grain multithreading has been used frequently as an approach for context switch program execution. The context switch represents a software control in which the register file is saved into a memory and restored when returning to the original program. Coarse grain multithreading represents the same approach as the context switch except that hardware of a microprocessor is responsible to save and restore the register file. Coarse grain multithreading is particularly useful when an operation takes hundreds of cycles to complete (e.g., very long latency operation). In this case, the processor can be better utilized by executing other programs (threads). Hardware-based thread switching can be used in case of a single thread execution as well as for fine grain multithreading or SMT. The stalled thread can be switched with another active thread. The time needed for storing (or saving) and restoring of the register file to the memory reduces the effectiveness of the second thread execution, especially when the register file is large (e.g., contains 32 entries or more).
Various microprocessors have been designed in an attempt to increase on-chip parallelism through superscalar techniques, which are directed to increasing instruction level parallelism (ILP), as well as through multithreading techniques, which are directed to exploiting thread level parallelism (TLP). A superscalar architecture attempts to simultaneously execute more than one instruction by fetching multiple instructions and simultaneously dispatching them to multiple (sometimes identical) functional units of the processor. A typical multithreading operating system (OS) allows multiple processes and threads of the processes to utilize a processor one at a time, usually providing exclusive ownership of the microprocessor to a particular thread for a time slice. In many cases, a process executing on a microprocessor may stall for a number of cycles while waiting for some external resource (for example, a load from a random access memory (RAM)), thus lowering efficiency of the processor. In accordance with embodiments of the present disclosure, SMT allows multiple threads to execute different instructions from different processes at the same microprocessor, using functional units that another executing thread or threads left unused.
For certain embodiments of the present disclosure, multithreading can be controlled through auxiliary (AUX) registers. AUX registers also may be referred to as special purpose registers that are accessible by instructions. Disclosed embodiments of the present disclosure include methods for pre-programming of threads and for switching threads on hardware conditions. In one or more embodiments, a thread can switch to another thread based on a software instruction, an interrupt, and/or by configuring AUX registers to switch execution of the thread to the other thread. In some embodiments, the performance monitor can be expanded to include a thread identifier (ID) for each thread switching event.
Described embodiments include a method and apparatus for efficient control of multithreading on a single core microprocessor, such as the microprocessor 200 illustrated in
Embodiments of the present disclosure support multithreading on a single core microprocessor for different applications, for example, SMT and coarse grain multithreading, supporting any multicore customer with any multi-context application, and employing a total of 16 threads on quad-thread SMT. A single core microprocessor (e.g., the microprocessor 200 illustrated in
For some embodiments of the present disclosure, out-of-order implementation can be adapted to multithreading and implemented at the microprocessor 200. In this way, the microprocessor 200 occupies smaller area and achieves more efficient power consumption without sacrificing running performance. As discussed above, because of out-of-order instruction execution at the microprocessor 200, the functional units with small area may be replicated, such as decode units 206, Early ALUs 208, LAQs 212 and Late ALUs 214. On the other hand, the functional units of the microprocessor 200 with large and expensive resources that may utilize an idle time of the resource effectively, such as instruction cache 216, data cache 218, BPU 220 and FPU 222 can be shared among multiple threads. In addition, infrequently used functional resources that may execute out-of-order instructions, such as DIV 222, IMUL 224, APEX 226, may be also shared among multiple threads executed at the microprocessor 200. In an illustrative embodiment, an example of utilization of the large resource can be the instruction cache 216; the decode unit 206 can consume one instruction per clock cycle and the instruction cache 216 shared among four threads can fetch four instructions per clock cycle. If the decode unit 206 can consume two instructions per clock cycle, then the instruction cache 216 can fetch eight instructions per clock cycle.
A microprocessor architecture presented in this disclosure such as the microprocessor 200 includes a very flexible and robust AUX register set (not shown in
As illustrated in
A field 306 of the thread configuration register 300 may comprise SMT[6:5] bits, which indicate a number of simultaneous threads in the core only with the MT bit set to 1. In some embodiments, a scratch memory (not shown) can be implemented in the microprocessor 200 to save and restore the register file 210 for thread switching. In this case, the save and restore mechanism for thread switching can be performed in hardware without assistance of software programming. SMT[6:5]=00 indicates support for a single thread, wherein one register file (e.g., the register file 210 of the microprocessor 200 shown in
A field 404 of the thread enable register 400 shown in
The field 406 of the thread enable register 400 illustrated in
For some embodiments, the thread PC register 500 can be modified using an AUX interface of the microprocessor. In an embodiment, architecture PC per thread can be implemented in a commit stage. A commit queue may be configured to update a specific thread PC register 500 using a thread ID. In one or more embodiments, the architecture PCs associated with all supported threads can be also accessible from the AUX interface.
For some embodiments, when a thread is inactive, the current processor states are written into a thread-processor-state register 600 that is associated with the inactive thread. When a thread is active, a thread-processor-state register 600 associated with the active thread can be restored to current processor states. In one or more embodiments, a thread processor states register 600 can be modified using an AUX interface of a microprocessor.
For some embodiments, the processor states stored in a thread-processor-state register 600 shown in
When a programmed event for thread switching is a trigger, the thread mailbox FIFO register may cause an interrupt to processor execution of the current thread, leading to thread switching. The current thread being switched can be recycled by pushing an entry of the thread comprising the mailbox register 700 back into the thread mailbox FIFO register, as discussed in more detail below in relation to
Embodiments of the present disclosure support three models for thread switching. In one embodiment, a switched thread may send (or transmit) an interrupt signal to a control processor to configure an identifier of a next thread in the mailbox register 700. For example, the control processor may provide, in response to the received interrupt signal, the identifier of the next thread in the field 702 of the mailbox register 700. The control processor can write to the thread mailbox FIFO register and other AUX registers from an external source. In another embodiment, as discussed, a mailbox register 700 of the current thread can be pushed back (or recycled) into the thread mailbox FIFO register. In yet another embodiment, the thread mailbox FIFO register can be configured to trigger writing the current thread that is being switched into an external memory dedicated for thread switching. The mailbox register 700 of the current thread that is being switched can be further utilized to obtain identification of the next thread from the field 702, and the next thread can be loaded from the external memory for future execution based on the identification of the next thread.
Specifically, the field 702 of the mailbox register 700 shown in
A mailbox entry corresponding to one of the threads may be written into a location of the thread mailbox FIFO register 802 based on a write pointer 804. In an embodiment, a mailbox entry 806 being part of the instruction can be written into the thread mailbox FIFO register 802 at a location indicated by the write pointer 804. In another embodiment, a mailbox entry 808 originating from an external source may be written into the thread mailbox FIFO register 802 at a location indicated by the write pointer 804. An external source may be an external memory dedicated for thread switching, an external control processor that configures a next thread based on an interrupt signal of a current thread, etc. In yet another embodiment, a recycled mailbox entry 810 may be pushed back and written into the thread mailbox FIFO register 802 at a location indicated by the write pointer 804. In some embodiments, upon detection of a condition to write a mailbox entry into the thread mailbox FIFO register 802 and writing the mailbox entry as in any of the aforementioned embodiments, the write pointer 804 may be incremented to point to a next mailbox entry location of the thread mailbox FIFO register 802. In an embodiment, FULL condition indicating that the thread mailbox FIFO register 802 is full can be detected when the write pointer 804 is incremented to be the same as a read pointer 812 that points to a mailbox entry to be next read from the thread mailbox FIFO register 802. In this case, another write operation to the thread mailbox FIFO register 802 would cause an exception since a mailbox entry location to which the write pointer 804 points is not empty.
In some embodiments, the read pointer 812 points to a mailbox (thread) entry of the thread mailbox FIFO register 802 that may corresponds to one thread of one or more threads running in a microprocessor. As discussed above, a condition to switch from that current thread to a next thread can be indicated in the SC field 704, and the next thread can be identified in the NT field 702 of the mailbox entry to which the read pointer 812 points. Upon detection of the condition to switch from the current thread to the next thread, the read pointer 812 may be incremented to point to a mailbox entry associated with the next thread. In an embodiment, a mailbox entry corresponding to the thread being switched can be removed from the thread mailbox FIFO register 802, if the R field 706 indicates that thread recycling is not enabled. In another embodiment, the mailbox entry 810 corresponding to the thread being switched can be recycled and pushed back into the thread mailbox FIFO register 802 based on Recycle indication 814 set by an appropriate value of the R field 706, as illustrated in
In some embodiments, when SMT is employed in the microprocessor 200 shown in
Embodiments of the present disclosure support thread switching in multithreading environment. As discussed herein, efficient thread switching may be achieved without software involvement. The thread switching presented in this disclosure may be programmable through internal programming or external control processor. The thread switching presented herein can be flexible through implementation of several different priority schemes.
The thread switching presented in this disclosure is more efficient since it is implemented in hardware rather than in software. For example, a typical software-based thread switching utilizes polling which wastes power consumption. In contrast, a multithread processor with mailbox registers and other AUX registers presented in this disclosure provides an efficient method of thread switching implemented in hardware. The implementation presented herein based on mailbox registers organized in FIFO manner is flexible and can be programmed by internal or external sources. The hardware based method of thread switching presented in this disclosure also reduces software complexity for implementation of multithread microprocessors.
Additional Considerations
The foregoing description of the embodiments of the disclosure has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the disclosure in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover: a, b, c, a-b, a-c, b-c, and a-b-c.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the disclosure may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Embodiments of the disclosure may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the disclosure be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5274790 | Suzuki | Dec 1993 | A |
5333284 | Nugent | Jul 1994 | A |
5463745 | Vidwans et al. | Oct 1995 | A |
5471591 | Edmondson et al. | Nov 1995 | A |
5519864 | Martell et al. | May 1996 | A |
5584038 | Papworth et al. | Dec 1996 | A |
5675758 | Sowadsky et al. | Oct 1997 | A |
5684971 | Martell et al. | Nov 1997 | A |
5761476 | Martell | Jun 1998 | A |
5948098 | Leung et al. | Sep 1999 | A |
5987620 | Tran | Nov 1999 | A |
6076145 | Iwata et al. | Jun 2000 | A |
6108769 | Chinnakonda et al. | Aug 2000 | A |
6112019 | Chamdani et al. | Aug 2000 | A |
6205543 | Tremblay et al. | Mar 2001 | B1 |
6233599 | Nation | May 2001 | B1 |
6247094 | Kumar et al. | Jun 2001 | B1 |
6272520 | Sharangpani et al. | Aug 2001 | B1 |
6341301 | Hagan | Jan 2002 | B1 |
6408325 | Shaylor | Jun 2002 | B1 |
6425072 | Meier et al. | Jul 2002 | B1 |
6557078 | Mulla et al. | Apr 2003 | B1 |
6697939 | Kahle | Feb 2004 | B1 |
6785803 | Merchant et al. | Aug 2004 | B1 |
7143243 | Miller | Nov 2006 | B2 |
7434032 | Coon et al. | Oct 2008 | B1 |
7610473 | Kissell | Oct 2009 | B2 |
7644221 | Chan et al. | Jan 2010 | B1 |
9348595 | Mizrahi et al. | May 2016 | B1 |
20010056456 | Cota-Robles | Dec 2001 | A1 |
20020083304 | Leenstra et al. | Jun 2002 | A1 |
20030005263 | Eickemeyer | Jan 2003 | A1 |
20030005266 | Akkary et al. | Jan 2003 | A1 |
20030033509 | Leibholz | Feb 2003 | A1 |
20030061467 | Yeh et al. | Mar 2003 | A1 |
20040015684 | Peterson | Jan 2004 | A1 |
20040139306 | Albuz et al. | Jul 2004 | A1 |
20040172523 | Merchant et al. | Sep 2004 | A1 |
20040243764 | Miller | Dec 2004 | A1 |
20050044327 | Howard et al. | Feb 2005 | A1 |
20050125802 | Wang et al. | Jun 2005 | A1 |
20050149936 | Pilkington | Jul 2005 | A1 |
20050273580 | Chaudhry et al. | Dec 2005 | A1 |
20060117316 | Cismas | Jun 2006 | A1 |
20070136562 | Caprioli et al. | Jun 2007 | A1 |
20070204137 | Tran | Aug 2007 | A1 |
20070266387 | Henmi | Nov 2007 | A1 |
20080082755 | Kornegay et al. | Apr 2008 | A1 |
20080295105 | Ozer | Nov 2008 | A1 |
20090037698 | Nguyen | Feb 2009 | A1 |
20100031268 | Dwyer et al. | Feb 2010 | A1 |
20100082945 | Adachi | Apr 2010 | A1 |
20100083267 | Adachi | Apr 2010 | A1 |
20100138608 | Rappaport et al. | Jun 2010 | A1 |
20100250902 | Abernathy et al. | Sep 2010 | A1 |
20110067034 | Kawamoto | Mar 2011 | A1 |
20110296423 | Elnozahy | Dec 2011 | A1 |
20120054447 | Swart et al. | Mar 2012 | A1 |
20120173818 | Martin | Jul 2012 | A1 |
20120278596 | Tran | Nov 2012 | A1 |
20120303936 | Tran et al. | Nov 2012 | A1 |
20130290639 | Tran et al. | Oct 2013 | A1 |
20130297912 | Tran et al. | Nov 2013 | A1 |
20130297916 | Suzuki | Nov 2013 | A1 |
20130339619 | Roy | Dec 2013 | A1 |
20140047215 | Ogasawara | Feb 2014 | A1 |
20140109098 | Sato | Apr 2014 | A1 |
20140189324 | Combs et al. | Jul 2014 | A1 |
20140372732 | Fleischman et al. | Dec 2014 | A1 |
20150220347 | Glossner | Aug 2015 | A1 |
20160004534 | Padmanabha et al. | Jan 2016 | A1 |
20160246728 | Ron et al. | Aug 2016 | A1 |
20160291982 | Mizrahi et al. | Oct 2016 | A1 |
20160306633 | Mizrahi et al. | Oct 2016 | A1 |
20170046164 | Madhavan et al. | Feb 2017 | A1 |
20170168949 | Jackson et al. | Jun 2017 | A1 |
Entry |
---|
Raju Pandey, Lecture Notes—“Process and Thread Scheduling”, Department of Computer Sciences, University of California, Davis, Winter 2005 (Year: 2005). |
Markovic, Nikola. “Hardware thread scheduling algorithms for single-ISA asymmetric CMPs.” (2015). (Year: 2015). |
Number | Date | Country | |
---|---|---|---|
20170351518 A1 | Dec 2017 | US |