The present invention relates to a dual execution mode processor, and more particularly, to techniques for switching between two (thread and lane) modes of execution.
Typical parallel programs consist of alternating serial/parallel regions. Existing approaches to running parallel programs rely on a “discontinuity” of the instruction stream. For example, the execution goes from single-threaded to multi-threaded in conventional CPUs, and from main CPU to separate accelerator in CPU+GPUs. There are notable limitations to this approach such as overhead of discontinuity, large granularity of the regions, and necessity of “communication” (even with shared memory) between regions.
Therefore improved techniques for executing parallel programs and for switching between serial and parallel regions would be desirable.
The present invention provides techniques for switching between two (thread and lane) modes of execution in a dual execution mode processor. In one aspect of the invention, a method for executing a single instruction stream having alternating serial regions and parallel regions in a same processor is provided. The method includes the steps of: creating a processor architecture having, for each architected thread of the single instruction stream, one set of thread registers, and N sets of lane registers across N lanes; executing instructions in the serial regions of the single instruction stream in a thread mode against the thread registers; executing instructions in the parallel regions of the single instruction stream in a lane mode against the lane registers; and transitioning execution of the single instruction stream from the thread mode to the lane mode or from the lane mode to the thread mode.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.
Provided herein are techniques for implementing dual execution (thread and lane) modes in the same processor executing a single stream of instructions (using a unified processor instruction set architecture instruction stream that can alternate between serial and parallel regions). Thus accordingly herein, one processor executes one stream of instructions but operates in two modes, and what the instruction does depends on the mode. Specifically, the present techniques accomplish vectorization in space by replicating the same instruction across multiple (architected) lanes, with a different set of registers for each lane. At any given time, a lane can be in one of two states: enabled or disabled. Enabled lanes perform operations. Disabled lanes do not perform operations. It is noted that the terms “enabled” and “disabled,” as used herein, refer to the architected lanes. Techniques are then provided herein for switching between the two modes of execution.
Use of a unified processor instruction stream that can alternate between serial and parallel regions results in a very cheap transition between regions and region-to-region data exchange, and therefore regions can be as small as a single instruction. Use of the present techniques leads to efficient execution of the parallel regions of a program. In particular, programs that run well on the old models (multiple CPUs, CPU+GPUs) also run well here. Furthermore, there are other programs that do not run efficiently in the older modes but run well in accordance with the present techniques.
As will be described in detail below, the present techniques involve executing an instruction stream that consists of alternating serial and parallel regions. By way of example only,
According to the present techniques, the instructions in the serial regions of the instruction stream 100 are executed in what is termed herein as “thread mode,” and the instructions in the parallel regions of the instruction stream 100 are executed in what is termed herein as “lane mode.” Specifically, provided herein is a unique processor architecture which includes, for each architected thread of instructions, one set of thread registers and N sets of lane registers. Accordingly, the instructions in the serial regions of the instruction stream 100 are executed in thread mode against the thread registers. The instructions in the parallel regions of the instruction stream 100 are executed in lane mode against the lane registers. The thread and lane registers will be described in detail below.
Methodology 200 of
The processor executes a single stream of instructions. As shown in step 202 of
Serial regions of the instruction stream 100 are processed, as per step 204, in thread mode using the thread registers, while parallel regions of the instruction stream 100 are processed, as per step 206, in lane mode using the lane registers. As shown in
In step 208, the manipulated data is stored in memory (storage), and the process is repeated beginning at step 202 with the next branch instruction. As shown in
A more detailed description of the thread and lane registers is now provided. As described above, each architected thread in the processor has one set of thread registers. According to an exemplary embodiment, the set of thread registers contains at least one of the following component registers: general purpose registers (GPR), floating point registers (FPR), vector registers (VR), status registers (SR), condition registers (CR), and auxiliary registers (AR). As provided above, the present techniques involve a single processor executing a single stream of instructions, wherein the processor can operate in either thread or lane mode. When operating in thread mode, the instruction stream will be executed by the processor against this set of thread registers.
As shown in
By contrast, each architected thread in the processor has N sets of lane registers. According to an exemplary embodiment, each set of lane registers contains at least one of the following component registers: general purpose registers (GPR), floating point registers (FPR), vector registers (VR), status registers (SR), condition registers (CR), and auxiliary registers (AR). As provided above, the present techniques involve a single processor executing a single stream of instructions, wherein the processor can operate in either thread or lane mode. When operating in lane mode, the instruction stream will be executed by the processor against each set of lane registers.
In one exemplary embodiment, the thread registers contain the same combination of component registers as at least one set of the lane registers. Alternatively, according to another exemplary embodiment, the thread registers contain a different combination of component registers from one or more sets of the lane registers. When the components of the set of thread registers are the same as the components of one set of the lane registers, there is a one-to-one correspondence between thread registers and lane registers. In this case, the semantics of an instruction in lane mode can be obtained from the semantics of the instruction in thread mode by substituting the corresponding lane register for the corresponding thread register. When the components of the set of thread registers are different from the components of one set of the lane registers, the correspondence between them is not exactly one-to-one. That requires different definitions for the semantics of instructions in thread and lane mode.
As shown in
In one exemplary embodiment, the same combination of lane registers is present in each set. In that case, each architected thread in the processor has N identical sets of lane registers. However, single-instance auxiliary registers may also be part of the architected state. For example, as shown in
A transition from thread mode to lane mode entails performing the same operation but repeated multiple times, one for each (architected) lane in the processor. For the execution of an instruction in lane mode the processor can be designed with multiple physical lanes, e.g., multiple hardware resources to support simultaneous execution of the instructions. Thus, transitioning from thread mode to lane mode the processor goes from performing one operation (serially) per instruction against one register set at a time—see above—to performing the same operation multiple times per instruction (in parallel) against multiple sets of registers on multiple (architected) lanes. Thus, in lane mode, operations are performed across multiple lanes. A distinction is made herein between physical lanes in the processor and architected lanes. For instance, as known in the art a multi-lane vector processor has multiple physical lanes which enable parallel data processing. Architected lanes, on the other hand, are virtual lanes constructed to run on the physical lanes of the processor. The number of physical lanes is decided based on hardware constraints like area, power consumption, etc. Each physical lane is a hardware unit capable of executing the operation defined by an instruction. When a processor has multiple physical lanes, they can execute in parallel, with multiple operations performed at the same time. Architected lanes are a construct to provide virtualization. Multiple architected lanes can be multiplexed on top of the existing physical lanes. This virtualization is implemented at the hardware level—each instruction generates multiple operations. Let a processor have N architected lanes and L physical lanes. In the case of one-to-one mapping of architected to physical lanes (L=N), the processor will operate as follows:
Thus, in accordance with the present techniques, if for example in lane mode there are 8 sets of lane registers (N=8) and thus 8 architected lanes, then in order to process the instruction stream in lane mode with a processor having 4 physical lanes the process will have to be repeated multiple times. To give a simple example, at least two iterations of the process would be required to process the instruction stream in lane mode across 8 architected lanes for a processor having 4 physical lanes. Assuming all 4 (physical) lanes are being used, then exactly two iterations would be required. However, it may be the case that only a portion of the processor is devoted to the present computation, thus requiring more iterations. For instance, if two (physical) lanes of the processor are devoted to the computation then four iterations would be needed to process the instruction stream in lane mode across 8 architected lanes.
Branch instructions control the evolution of the instruction stream. Namely, the default condition is to execute the instruction at the next sequential memory address. Only branch instructions can change that flow. According to the present techniques, branches always have the same semantics, independent of thread/lane mode. Conditional branches always test a thread condition register. In one exemplary embodiment, branches to an address contained in a register always use a thread register. As will be described in detail below, execution preferably begins in thread mode and explicit instructions are used to transition from thread to lane mode.
As described generally above, data moves back and forth between the registers and storage. For example, data is fetched from the storage and loaded into the (thread and/or lane) registers where it is manipulated by the instruction stream. The manipulated data can then be stored back in memory. See description of
According to an exemplary embodiment of the present techniques, these storage access instructions are (thread or lane) mode dependent. For example, in this instance—when the stream of instructions is being executed in thread mode, the load and store operations are always applied to the thread registers. The thread registers are thus used as the data source, the data target, and the address source. As provided above, in thread mode operations are performed (serially) one at a time. Thus, each load/store operation is executed unconditionally in thread mode and causes one memory operation.
By contrast, when the instructions are being executed in lane mode, the load and store operations are always applied to lane registers. Accordingly, the lane registers are used as the data source, the data target, and the address source. In lane mode operations are performed (in parallel) on N (architected) lanes. Thus, each load/store operation is executed once per lane, with up to N memory operations/instructions. However, as highlighted above, operations in lane mode are conditional on the state (enabled/disabled) of each (architected) lane. For instance, when performing load/store operations in lane mode across multiple lanes, only those lanes that are enabled can be used. Thus, load/store execution in lane mode is contingent upon whether a given lane is enabled or not.
Similarly, according to an exemplary embodiment of the present techniques, arithmetic and logic instructions are also (thread or lane) mode dependent. For example, in this instance—when the stream of instructions is being executed in thread mode, arithmetic and logic instructions are always applied to thread registers. The thread registers are thus used as the data source and the data target. As provided above, in thread mode operations are performed (serially) one at a time. Thus, each arithmetic/logic instruction is executed unconditionally in thread mode and causes one operation.
By contrast, when the instructions are being executed in lane mode, the arithmetic and logic instructions are always applied to lane registers. Accordingly, the lane registers are used as the data source and the data target. In lane mode, operations are performed (in parallel) on N (architected) lanes. Thus, each arithmetic/logic instruction is executed once per lane, with up to N operations/instructions. However, as highlighted above, operations in lane mode are conditional on the state (enabled/disabled) of each lane. For instance, when executing arithmetic/logic instructions in lane mode across multiple lanes, only those lanes that are enabled can be used. Thus, arithmetic/logic instruction execution in lane mode is contingent upon whether a given lane is enabled or not.
It is notable that when operating in lane mode, according to one exemplary embodiment of the present techniques the instructions are executed in lockstep across all of the lanes. Namely, the (same) instruction which is dispatched to each of the lanes is executed at the same time, in parallel across each of the lanes. Alternatively, according to another exemplary embodiment of the present techniques the instructions dispatched in lane mode are executed asynchronously (i.e., not at the same time) across the lanes. By way of example only, execution of instructions at one or more of the lanes might be contingent upon completion of an operation at one or more other of the lanes.
Independent of the way instructions are executed, instruction execution can follow global program order or local program order. With global program order, the effects of all previous dependent instructions on all of the lanes are visible to the current executing instruction in each lane. With local program order, only the effects of previous dependent instructions on the same lane are guaranteed to be visible in each lane.
Transitioning between thread and lane mode execution will be described in detail below. In general, however, bridge instructions inserted within the instruction stream can be used to explicitly control the execution mode of the stream. Namely, bridge instructions can encode when serial regions or parallel regions of the instruction stream exist and should thus be executed in thread or lane mode, respectively. According to an exemplary embodiment, bridge instructions in the instruction stream can be executed and have the same semantics in either (thread or lane) mode. The transitioning from thread mode to lane mode, and vice versa, can involve copying thread registers to lane registers and vice versa.
In step 504, instructions in the serial regions of the instruction stream are executed in thread mode against the thread registers. Thread mode execution can involve dispatching the thread mode instructions once to the thread registers. As provided above, in thread mode instructions are preferably always applied to the thread registers and each instruction is executed unconditionally and causes one operation.
In step 506, instructions in the parallel regions of the instruction stream are executed in lane mode against the lane registers. Lane mode execution can involve dispatching the same instruction multiple times, i.e., dispatching the lane mode instructions N times, once for each of the N lanes. As provided above, in lane mode instructions are preferably always applied to the lane registers and each instruction is executed once per lane, with up to N operations/instructions. Execution of lane mode instructions is however contingent upon the state of the lane (enabled/disabled). Thus, as shown in
As provided above, transitioning execution of the instruction stream from thread mode to lane mode, or vice versa can include copying the thread registers to the lane registers or vice versa. This can be performed in step 508 in response to a bridge instruction encoded in the instruction stream signifying a transition from a serial region to a parallel region of the stream, or vice versa.
Given the above description of dual (thread and lane) execution modes for an instruction stream, techniques are now provided for transitioning the processor core from thread to lane mode (and vice versa) and for enabling data transfer between the two modes. As provided above, in thread mode there is one set of registers and in lane mode there are multiple sets of registers (one set per lane). There is however only a single set of instructions. Thus the challenge becomes how to transition from having one set of registers to having one set of registers per lane. Provided below are techniques for instructing the processor that following a thread-to-lane/lane-to-thread mode transition the instruction has a different meaning.
In general, these thread-to-lane/lane-to-thread mode transitioning techniques involve the following actions: prepare and transfer the necessary state from thread to lane resources, change the processor to lane mode, execute multi-lane computation, prepare and transfer the necessary state from lane to thread resources, and change the processor to thread mode. A detailed description of the present thread-to-lane/lane-to-thread mode transitioning techniques is now provided by way of reference to methodology 600 of
According to an exemplary embodiment, execution of the instruction stream starts out in thread mode. See step 602. Thread mode execution will continue until, as per step 604, either a request is made to transition into lane mode execution (i.e., a lane mode request) or the computation is finished. If the computation is finished, then the process is ended.
On the other hand, if a lane mode request is encountered, then in step 606 a transition is made from thread to lane mode. As provided above, bridge instructions encoded in the instruction stream can signify when serial regions or parallel regions of the instruction stream exist and should thus be executed in thread or lane mode, respectively. Thus the transition from thread to lane mode, or vice versa can be in response to a bridge instruction encoded in the instruction stream.
According to the present techniques, the initiation of a transition from thread mode to lane mode is a fairly straightforward process. Namely, the thread-to-lane mode transition occurs (voluntarily) based on encountering an explicit instruction such as a lane mode request. By comparison, as will be described in detail below, transitioning from lane mode to thread mode however can occur either voluntarily (i.e., in response to encountering an explicit instruction such as a thread mode request) or involuntarily when an exception occurs during one of the instructions in lane mode—see below.
A detailed description of the process of switching the processor core from thread mode to lane mode (as per step 606) is provided in conjunction with the description of
In transitioning the core from thread to lane mode a special instruction can be invoked that will, e.g., set a special flag/register in the processor core such that all subsequent instructions are executed in lane mode. See for example step 706 of
In step 702, the state of the processor core is transferred from thread mode to lane mode. According to an exemplary embodiment, step 702 involves, but is not limited to, i) transferring content from the thread registers to the lane registers (see above), ii) initializing one or more of the lane registers, iii) allocating a memory stack for each lane and setting the lane stack registers correspondingly, and/or iv) setting the table of contents (TOC) pointer of each lane to the thread TOC (such that the process can continue in lane mode where the thread mode execution ended).
In step 704, all of the (architected) lanes are marked as enabled. It is notable that lanes can be subsequently enabled or disabled using special instructions. A description of enabled/disabled lanes was provided above. Lanes are enabled/disabled to implement control flow divergence. Control flow divergence happens when the instruction stream contains instructions that should not be executed in some of the lanes. Those lanes must then be disabled. At a later point in the execution, control flow reconverges (that is, instructions should again be executed in lanes that were disabled) and disabled lanes are enabled again.
Finally in step 706, a special instruction is invoked to change the processor mode to lane mode. According to an exemplary embodiment, the special instruction sets a special flag/register within the processor core such that all following instructions are executed in lane mode.
It is notable that, in accordance with the present techniques, memory is shared between thread and lane computations. Instructions in lane mode access the same memory address space as instructions in thread mode, and vice versa.
As provided above, the transition of the processor core from lane mode to thread mode can be slightly more complicated. Specifically, when the processor core is operating in lane mode, a switch to thread mode can occur either (voluntarily) in response to an explicit switching instruction such as a thread mode request, or (involuntarily) when an instruction causing an exception occurs (i.e., thus making thread mode a default state). The first case (case A: Explicit instructions) is described in conjunction with the description of methodology 800A
In step 802A, the state of the processor core is transferred from thread mode to lane mode. According to an exemplary embodiment, step 802A involves, but is not limited to, i) saving the lane registers to memory (see, for example, step 208 of
In step 804A, a special instruction is invoked to change the processor mode to thread mode. According to an exemplary embodiment, the special instruction sets a special flag/register within the processor core such that all following instructions are executed in thread mode. In step 806A, the state used by the lanes is freed, and in step 808A the instruction stream is executed in thread mode. By “state” we mean possible cpu and memory resources (e.g., stack space) that were allocated by the compiler before starting lane mode.
Alternatively, the lane mode can also be interrupted and the core returned to the normal thread mode when an exception occurs during one of the instructions in lane mode. In that case, an exception handler will change the core to thread mode and a return from interrupt will restore the lane mode status. See for example
In this example, as per step 802B, during execution of the instruction stream in lane mode an instruction occurs causing an exception. A program counter (PC) marks or points to the current instruction (or alternatively the next instruction) being executed. When an exception occurs it is desirable to interrupt the lane mode execution and return the core to the normal (default) thread mode. An attempt will however be made to return to the desired lane mode once the exception has been handled. Thus, in step 804B, the necessary state is saved to subsequently resume lane mode. According to an exemplary embodiment, this includes saving the state of the lanes causing the exception and/or saving the state of the lane registers.
Next, in step 806B, instructions are invoked to switch from lane to thread mode. Once the core is transitioned back into the normal thread mode, the exception can be resolved (i.e., handled). See step 808B. Exceptions can be resolved using an exception handler as known in the art. Once the exception has been resolved, lane mode status can be restored. For instance, in step 810B, the lane mode state is restored and in step 812B the core is transitioned to lane mode. In step 814B, the computation is resumed from where it was left off in step 804B (see above) and the instructions at the lanes causing the exception are retried.
Given the above description of the present thread-to-lane/lane-to-thread mode transitioning techniques, the following is a non-limiting example of how the registers are prepared for a thread to lane mode shift and vice versa:
Example: Assume user wants to execute the function foo(A, B, . . . ) in lane mode, wherein L is the number of lanes, LGR[0..L][0..32] are general purpose registers for each lane, and GPR[32] are general purpose registers for thread mode. A compiler or the user must wrap the function foo into a single instruction multiple lane execution wrapper that will perform the following actions:
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Turning now to
Apparatus 900 includes a computer system 910 and removable media 950. Computer system 910 includes a processor device 920, a network interface 925, a memory 930, a media interface 935 and an optional display 940. Network interface 925 allows computer system 910 to connect to a network, while media interface 935 allows computer system 910 to interact with media, such as a hard drive or removable media 950.
Processor device 920 can be configured to implement the methods, steps, and functions disclosed herein. The memory 930 could be distributed or local and the processor device 920 could be distributed or singular. The memory 930 could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to be read from, or written to, an address in the addressable space accessed by processor device 920. With this definition, information on a network, accessible through network interface 925, is still within memory 930 because the processor device 920 can retrieve the information from the network. It should be noted that each distributed processor that makes up processor device 920 generally contains its own addressable memory space. It should also be noted that some or all of computer system 910 can be incorporated into an application-specific or general-use integrated circuit.
Optional display 940 is any type of display suitable for interacting with a human user of apparatus 900. Generally, display 940 is a computer monitor or other similar display.
Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope of the invention.
This application is a continuation of U.S. application Ser. No. 14/552,145 filed on Nov. 24, 2014, the disclosure of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
Parent | 14552145 | Nov 2014 | US |
Child | 14870367 | US |