I. Field of the Disclosure
The technology of the disclosure relates to processing of pipelined computer instructions in central processing unit (CPU)-based systems.
II. Background
The advent of “instruction pipelining” in modern computer architectures has yielded improved utilization of central processing unit (CPU) resources and faster execution times of computer applications. Instruction pipelining is a processing technique whereby a throughput of computer instructions being processed by a CPU may be increased by splitting the processing of each instruction into a series of steps. The instructions are executed in an “execution pipeline” composed of multiple stages, with each stage carrying out one of the steps for each of a series of instructions. As a result, in each CPU clock cycle, steps for multiple instructions can be evaluated in parallel. A CPU may employ multiple execution pipelines to further boost performance.
Some computer architectures that implement instruction pipelining may permit processor optimizations, such as speculative data reads and out-of-order execution of program instructions. While providing further CPU performance improvement, these optimizations may lead to unintended and/or undesirable program behavior if, for example, the executing program depends on data being accessed or instructions being executed in a specified order. Additionally, an executing instruction may effect a change in a state of the CPU that must be successfully completed before subsequent instructions are allowed to execute. For example, a change in a state of the CPU may include changes that affect how the subsequent instructions access resources, such as a change in processor mode or a modification of a page table.
To ensure proper program execution, a “synchronization barrier” may be used in software to ensure that a prior operation (i.e., a data access or instruction execution) completes before code execution is permitted to continue. A synchronization barrier may be expressly provided by an instruction, such as the ARM architecture ISB (instruction synchronization barrier) instruction, or may be implemented as part of another instruction or operation. A computer's architecture may provide that specific operations requiring a synchronization barrier may have the synchronization automatically handled by the computer's hardware, while other operations require software to expressly include a synchronization barrier. Note however, that for scenarios in which a software synchronization barrier is present, the software synchronization barrier may prove redundant if another synchronization operation occurs immediately prior to execution of the software synchronization barrier.
Embodiments of the disclosure include eliminating redundant synchronization barriers from execution pipelines in instruction processing circuits, and related processor systems, methods, and computer-readable media. For some operations, a computer's architecture may require that a software synchronization barrier be employed, even though a synchronization operation may also occur immediately prior to execution of the software synchronization barrier. By tracking the occurrence of synchronization events, unnecessary software synchronization barriers may be identified and eliminated, thus improving performance of a central processing unit (CPU).
In this regard, in one embodiment, a method for eliminating redundant synchronization barriers in an instruction stream is provided. The method comprises detecting a first synchronization event. The method further comprises detecting a next instruction in an instruction stream. The method additionally comprises determining whether the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event. The method also comprises eliminating the next instruction from the instruction stream, responsive to determining that the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event. In this manner, the average number of instructions executed during each clock cycle by the CPU may be increased by avoiding unnecessary synchronization operations.
In another embodiment, an instruction processing circuit is provided. The instruction processing circuit comprises a synchronization event detection circuit and an optimization circuit. The synchronization event detection circuit is configured to detect a first synchronization event. The optimization circuit is configured to detect a next instruction in an instruction stream, and determine whether the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event. The optimization circuit is further configured to eliminate the next instruction from the instruction stream, responsive to determining that the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event.
In another embodiment, an instruction processing circuit is provided. The instruction processing circuit comprises a means for detecting a first synchronization event. The instruction processing circuit further comprises a means for detecting a next instruction in an instruction stream. The instruction processing circuit additionally comprises a means for determining whether the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event. The instruction processing circuit also comprises a means for eliminating the next instruction from the instruction stream, responsive to determining that the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event.
In another embodiment, a non-transitory computer-readable medium is provided, having stored thereon computer-executable instructions to cause a processor to implement a method. The method implemented by the computer-executable instructions comprises detecting a first synchronization event. The method implemented by the computer-executable instructions further comprises detecting a next instruction in an instruction stream. The method implemented by the computer-executable instructions additionally comprises determining whether the next instruction comprises a synchronization bather of a type corresponding to the first synchronization event. The method implemented by the computer-executable instructions also comprises eliminating the next instruction from the instruction stream, responsive to determining that the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event.
With reference now to the drawing figures, several exemplary embodiments of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. It is also to be understood that, although the terms “first,” “second,” etc may be used herein to describe various elements, these terms are only used to distinguish one element from another, and the elements thus distinguished are not to be limited by these terms. For example, a first instruction could be termed a second instruction, and, similarly, a second instruction could be termed a first instruction, without departing from the teachings of the disclosure.
Embodiments of the disclosure include eliminating redundant synchronization barriers from execution pipelines in instruction processing circuits, and related processor systems, methods, and computer-readable media. For some operations, a computer's architecture may require that a software synchronization barrier be employed, even though a synchronization operation may also occur immediately prior to execution of the software synchronization barrier. By tracking the occurrence of synchronization events, unnecessary software synchronization barriers may be identified and eliminated, thus improving performance of a central processing unit (CPU).
In this regard, in one embodiment, a method for eliminating redundant synchronization barriers in an instruction stream is provided. The method comprises detecting a first synchronization event. The method further comprises detecting a next instruction in an instruction stream. The method additionally comprises determining whether the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event. The method also comprises eliminating the next instruction from the instruction stream, responsive to determining that the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event. In this manner, the average number of instructions executed during each clock cycle by the CPU may be increased by avoiding unnecessary synchronization operations.
In this regard,
With continuing reference to
After the instructions 30(0)-30(N) have been fetched and decoded, they are optionally issued to an instruction queue 32 as a buffer for storing the instructions 30(0)-30(N). The instructions 30(0)-30(N) are then issued to one of the execution pipelines 12(0)-12(Q) for execution. In some embodiments, the execution pipelines 12(0)-12(Q) may restrict the types of operations that may be carried out by instructions that execute within the execution pipelines 12(0)-12(Q). For example, pipeline P0 may not permit read access to the registers 16(0)-16(M); accordingly, an instruction that indicates an operation to read register R0 may only be issued to one of the execution pipelines P1 through PQ.
The instruction processing circuit 14 may be any type of device or circuit, and may be implemented or performed with a processor, a digital signal processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. In some embodiments, the instruction processing circuit 14 is incorporated into the instruction decode circuit 28.
With continuing reference to
To provide an explanation of detecting and eliminating a redundant synchronization barrier in the processor-based system 10 in
As noted above, the computer architecture requires the INST_REQ_SYNC instruction 42 to be followed by a software synchronization barrier. Accordingly, a. SYNC_BARRIER_INST instruction 46 is detected next in the detected instruction stream 40 by the instruction processing circuit 14. The SYNC_BARRIER_INST instruction 46 is a synchronization barrier instruction that causes a synchronization event 48 to occur. The synchronization event 48 triggered by the SYNC_BARRIER_INST instruction 46 is of the same type as the synchronization event 44. As used herein, the “type” of a synchronization event refers to a general categorization of the synchronization event as, for example, a data synchronization operation or an instruction synchronization operation. A synchronization event may be considered a “full” synchronization event if it ensures barrier operations for both read and write operations, and applies to both inner- and outer-cacheable memory systems and both shareable and non-shareable memory. Alternatively, the synchronization event may be more limited in scope in that it ensures barrier operations only in narrow circumstances than a full synchronization event. It is to be understood that a synchronization event may be considered of the same type as a preceding synchronization event if the synchronization event belongs to the same general categorization and is of a same or narrower scope as the preceding synchronization event. In this example, note that because no other instruction executes after the synchronization event 44 and before the synchronization event 48, the synchronization event 48, and the SYNC_BARRIER_INST instruction 46 that triggered it, are redundant and may be eliminated by the instruction processing circuit 14. A resulting optimized instruction stream 50 illustrates one exemplary result of the process described above. The resulting optimized instruction stream 50 includes an INSTR_REQ_SYNC instruction 52 corresponding to the INSTR_REQ_SYNC instruction 42. Like the INST_REQ_SYNC instruction 42, the INST_REQ_SYNC instruction 52 is an instruction indicating an operation to be followed by a software synchronization barrier, and for which the computer hardware is also permitted to perform a synchronization operation. Accordingly, in this example, the computer hardware carries out a synchronization operation in response to the INST_REQ_SYNC instruction 52, resulting in a synchronization event 54. In some embodiments, the synchronization event 54 may be a data synchronization operation, while some embodiments may provide that the synchronization event 54 is an instruction synchronization operation. As seen in
The instruction processing circuit 14 then detects a next instruction in an instruction stream (block 60). The instruction processing circuit 14 determines whether the next instruction comprises a synchronization barrier of a type corresponding to the first synchronization event (block 62). For example, the instruction processing circuit 14 determines whether the first synchronization event and the next instruction are both considered data synchronization operations, or whether both are instruction synchronization operations. If the next instruction does not comprise a synchronization barrier of a type corresponding to the first synchronization event, processing of the instruction stream continues at block 64 of
If a synchronization event is detected at block 68, a synchronization flag corresponding to a type of the synchronization event data synchronization or instruction synchronization) is set (block 70). The synchronization flag indicates whether a synchronization event occurred immediately prior to execution of a next instruction. Some embodiments may provide that the synchronization flag indicates the occurrence of a data synchronization event, while in some embodiments the synchronization flag corresponds to an occurrence of an instruction synchronization event. Processing then resumes at block 72 of
The instruction processing circuit then detects a next instruction in an instruction stream, such as the instruction stream 18 (block 72). The instruction processing circuit determines whether a synchronization event, for example the synchronization event 44 of
If the instruction processing circuit determines at block 74 of
If, at decision block 78 of
It is to be understood that operations for detecting the detected instruction and the synchronization event may be carried out by, for example, the synchronization event detection circuit 34 of the instruction processing circuit 14 of
To illustrate optimization of an exemplary instruction stream containing an instruction triggering a synchronization event and a redundant synchronization barrier,
As noted above, the ARM architecture requires the MCR instruction 86 to be followed by a software instruction synchronization operation. Thus, an ARM architecture ISB (“instruction synchronization barrier”) instruction 90 is detected next the detected instruction stream 84. The ISB instruction 90 is a synchronization barrier instruction that causes a synchronization event 92 to occur. The synchronization event 92 triggered by the ISB instruction 90 is of the same type (i.e., an instruction synchronization operation having the same or narrower scope as the synchronization event 88. Note that because no other instruction executes after the synchronization event 88 and before the synchronization event 92, the synchronization event 92, and the ISB instruction 90 that triggered it, are redundant and may be eliminated by the instruction processing circuit 14.
A resulting optimized instruction stream 94 illustrates one exemplary result. The resulting optimized instruction stream 94 includes an MCR instruction 96 corresponding to the MCR instruction 86. In response to execution of the MCR instruction 96, the computer hardware carries out an instruction synchronization operation, resulting in a synchronization event 98. However, the ISB instruction 90 has been replaced in this instance by an NOP instruction 100 in the resulting optimized instruction stream 94. Thus, there is no redundant synchronization event immediately following the synchronization event 98, resulting in improved CPU performance and instruction throughput.
As noted above with respect to
Resulting optimized instruction stream examples 104 illustrate exemplary sequences of instructions into which the instructions in the detected instruction stream 102 may be processed by the instruction processing circuit 14 of
As discussed above, a synchronization event preceding a software synchronization barrier may result from operations unrelated to instruction execution, such as an interrupt or an exception return. In this regard,
A resulting optimized instruction stream 114 illustrates one exemplary result. As the resulting optimized instruction stream 114 is being processed, a synchronization event 116 occurs in response to an operation such as an interrupt or an exception return. However, the NB instruction 110 has been replaced in this instance by an NOP instruction 118 in the resulting optimized instruction stream 114. Thus, there is no redundant synchronization event immediately following the synchronization event 116, resulting in improved CPU performance and instruction throughput.
Eliminating redundant synchronization barriers from execution pipelines in instruction processing circuits, and related processor systems, methods, and computer-readable media according to embodiments disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a computer, a portable computer, a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, and a portable digital video player.
In this regard,
Other master and slave devices can be connected to the system bus 128. As illustrated in
The CPU(s) 122 may also be configured to access the display controller(s) 140 over the system bus 128 to control information sent to one or more displays 146. The display controller(s) 140 sends information to the display(s) 146 to be displayed via one or more video processors 148, which process the information to be displayed into a format suitable for the display(s) 146. The display(s) 146 can include any type of display, including but not limited to a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the embodiments disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer-readable medium and executed by a processor or other processing device, or combinations of both. The arbiters, master devices, and slave devices described herein ma be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a processor, a DSP, an Application Specific Integrated Circuit (ASIC), an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The embodiments disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary embodiments herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary embodiments may be combined. It is to be understood that the operational steps illustrated in the flow chart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.