The present disclosure relates to computer technology, and more specifically to processor technology.
In a processor, the instruction pipeline is usually used in order to improve the efficiency of executing instructions by the processor. Instruction pipeline refers to the division of the operation of an instruction into multiple steps, and each step is completed by a dedicated circuit unit. The instruction pipeline is usually divided into stages including instruction fetching (IF), decoding (ID), execution (EXE), writing back (WB) and other stages. In the processor core where instructions are issued sequentially, partially executed out of order, and written back out of order, one instruction is fetched in each cycle in IF stage. In the EXE stage, the instruction enters the execution unit corresponding to the execution delay according to different functions. The execution unit executes an instruction in each cycle, and the final calculation or execution result is written back to the register file out of order. Improving write-back efficiency in the instruction pipeline is critical to improving the overall performance of the processor.
An apparatus and a method for writing back an instruction execution result, a processing apparatus, an electronic device and a medium are provided in the present disclosure.
An apparatus for writing back an instruction execution result is provided. The apparatus includes: a first writing port, coupled between a first execution unit with a first execution delay and a register file, and configured to receive a first execution result from the first execution unit, and to write the first execution result back to a first register unit in the register file based on a first writing address; and a second writing port, coupled between a second execution unit with a second execution delay different from the first execution delay and the register file, and configured to receive a second execution result from the second execution unit, and to write the second execution result back to a second register unit in the register file based on a second writing address; in which the first writing port is not coupled to the second execution unit, and the second writing port is not coupled to the first execution unit.
A processing apparatus is provided. The apparatus includes: a register file, including a plurality of register units; a plurality of execution units, configured to execute instructions respectively and output execution results with execution delays; and a plurality of writing ports, in which each writing port is configured to be coupled between an execution unit with a corresponding execution delay and the plurality of register units according to the execution delays of the plurality of execution units, to receive an execution result from the execution unit with the corresponding execution delay, and to write the execution result back to any one of the plurality of register units corresponding to a writing address.
A method for writing back. an instruction execution result is provided. The method includes: receiving, via a first writing port, a first execution result from a first execution unit with a first execution delay, and writing the first execution result back to a first register unit in a register file based on a first writing address; and receiving, via a second writing port, a second execution result from a second execution unit with a second execution delay different from the first execution delay, and writing the second execution result back to a second register unit in the register file based on a second writing address; in which the first writing port is not coupled to the second execution unit, and the second writing port is not coupled to the first execution unit.
It should be understood that, the content described in the part is not intended to identify key or important features of embodiments of the present disclosure, nor intended to limit the scope of the present disclosure. Other features of the present disclosure will be easy to understand through the following specification.
The drawings are intended to better understand the solution, and do not constitute a limitation to the disclosure.
The exemplary embodiments of the present disclosure are described as below with reference to the accompanying drawings, which include various details of embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Therefore, those skilled in the art should realize that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present disclosure. Similarly, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following descriptions.
The technical solution according to the embodiment of the present disclosure solves the problems caused by setting a cache unit for the write port in the instruction pipeline and adding a feedback path, which can reduce pipeline blockage, significantly improve pipeline efficiency, reduce resource overhead, reduce timing pressure, and improve the processor frequency, thereby improving the performance of the entire processor.
Referring to
In the solution shown in
Referring to
In the solution shown in
In the solution shown in
Compared with the solution shown in
For a specific processor, the delay characteristics of execution units of different functions in the pipeline are more obvious. In view of these execution delay characteristics, reasonable allocation of resources and improvement of the write-back efficiency of different functional pipelines are critical to improving the overall performance of the processor.
At least in order to solve the above-mentioned problems, the embodiments of the present disclosure provide a solution to improve the write-back efficiency in the instruction pipeline. According to an embodiment of the present disclosure, multiple writing ports are provided for one register file, and the multiple writing ports respectively correspond to execution units with respective execution delays. Multiple writing ports are divided and allocated to respective execution units with corresponding execution delays, and each writing port can access any register unit in the register file based on the writing address, Through this writing port and channel configuration, performance improvements and enhancements will he obtained.
According to the embodiments of the present disclosure, it is possible to reduce pipeline blockage caused by write back, significantly improve pipeline efficiency, and thereby improve the performance of the entire processor. Moreover, the embodiments of the present disclosure avoid the use of the cache queue required to ensure correctness in the traditional solution, and reduce the resource overhead, especially in a scenario where the register bits are wide, which can greatly reduce the use of hardware resources. Therefore, the embodiments of the present disclosure are beneficial to the reduction of the processor area and the reduction of power consumption. In addition, the embodiments of the present disclosure eliminate the feedback path related to the cache queue, thereby reducing the pressure of timing processing and increasing the frequency of the processor.
Hereinafter, various exemplary embodiments of the present disclosure will be described in detail with reference to various embodiments in conjunction with the accompanying drawings.
The fetching unit 202 is configured to fetch instructions. The decoding unit 204 is configured to decode the fetched instruction. The decoded instructions of different types are respectively input to corresponding execution units of different functions.
The first execution unit 206 is configured to receive the decoded first type of instruction and execute the instruction. The first execution unit 206 is configured to output the first execution result with a first execution delay. The second execution unit 208 is configured to receive the decoded second type of instruction and execute the instruction. The second execution unit 208 is configured to output the second execution result with the second execution delay. The first execution delay is different from the second execution delay. Those skilled in the art can understand that the “first execution delay is different from the second execution delay” in the present disclosure means that there is a significant difference between the first execution delay and the second execution delay, so that the write-back of the first execution result may be occurred in the same cycle of the write-back of the second execution result.
It should be understood that the instruction fetching unit 202, the decoding unit 204, the first execution unit 206, and the second execution unit 208 are circuit units corresponding to the instruction fetching, decoding, and execution stages of the instruction pipeline, respectively. These circuit units are known in the art, and detailed descriptions are omitted here.
Referring to
The first writing port 210 is coupled between the first execution unit 206 having a first execution delay and the register file 214. The first writing port 210 is configured to receive the first execution result from the first execution unit 206 and write the first execution result back to the first register unit in the register file 214 based on the first writing address. The first writing port 210 is coupled to all register units 216 in the register file 214 via a first path 218. The first path 218 includes a first wiring set, and each wiring of the first wiring set is respectively coupled between the first writing port 210 and all the register cells 216 in the register file 214. Therefore, any register unit 216 in the register file 214 can be directly accessed via the first writing port 210. In this way, the instruction execution result output with the first execution delay can be directly written back to any register unit 216 in the register file 214 via the first writing port 210, thereby improving the write-back efficiency.
The second writing port 212 is coupled between the second execution unit 208 having a second execution delay and the register file 214. The second writing port 212 is configured to receive the second execution result from the second execution unit 208 and write the second execution result back to the second register unit in the register file 214 based on the second writing address. The second writing port 212 is coupled to all the register units 216 in the register file 214 via the second path 220. The second path 220 includes a second wiring set, and each wiring of the second wiring set is respectively coupled between the second writing port 212 and all the register cells 216 in the register file 214. Therefore, any register unit 216 in the register file 214 can be directly accessed via the second writing port 212. In this way, the instruction execution result output with the second execution delay can be directly written back to any register unit 216 in the register file 214 via the second writing port 212, thereby improving the write-back efficiency.
In some embodiments, the processing apparatus 200 may include multiple execution units, and the apparatus 230 may include multiple writing ports. The multiple execution units are configured to execute instructions respectively, and respectively output execution results with execution delays. The multiple writing ports are configured to respectively correspond to the execution delays of the multiple execution units. Multiple writing ports are respectively allocated to execution units with corresponding execution delays. Each writing port is coupled between an execution unit with corresponding execution delay and a plurality of register units according to the execution delay of the execution unit. Each writing port is configured to receive an execution result from an execution unit with a corresponding execution delay, and write the execution result hack to any one of the plurality of register units corresponding to the writing address.
According to the embodiments of the present disclosure, in a processor core where instructions are issued sequentially, partially executed out of order, and written back out of order, one instruction is fetched in each cycle in the instruction fetching stage, and the instruction has a certain value during the execution stage. The execution unit of the execution delay is executed, and the instruction execution result is written back to the register file via the writing port corresponding to the execution delay. Since one instruction is executed in each cycle, and each writing port is assigned to an execution unit with a corresponding execution delay, there is no conflict problem of simultaneous input of multiple instruction execution results for each writing port. In this way, it is possible to avoid the use of arbitration logic or multiplexers, buffer queues, and feedback paths that are necessary in traditional solutions. In this way, the pipeline block is reduced, and the pipeline efficiency is significantly improved, thereby improving the performance of the entire processor; avoiding the use of cache space and reducing resource overhead, especially in the scene of large register bits, the reduction of hardware resources is greater, and there may be a positive effect on the processor area power consumption; and the feedback path related to the cache space is eliminated, the timing pressure is reduced, and the processor frequency is increased.
In addition, for the conflict problem of writing a register unit at the same time, the known processor structures in the art, such as sequential launch, partial out-of-order execution, and out-of-order return, all have a hardware processing structure that detects conflicts and avoids conflicts. This conflict problem can be avoided in the decoding stage, In the embodiments of the present disclosure, the processing logic in the solutions known in the art can be directly applied to avoid the correctness problem of multiple writing ports simultaneously writing back to the register unit of the same address without additional processing structure.
The third execution unit 302 is configured to receive the decoded third type of instruction and execute the instruction. The third execution unit 302 is configured to output the third execution result with a third execution delay. The third execution delay is different from the first execution delay of the first execution unit 206 and the second execution delay of the second execution unit 208. Similarly, the above “the third execution delay is different from the first execution delay and the second execution delay” means that there is a significant difference between the third execution delay, the first execution delay and the second execution delay, so that the write-back of the third execution result may be occurred in the same cycle of the write-back of the first or the second execution result. In some embodiments, the third execution unit 302 is an execution unit that is frequently used or whose path feedback to the fetching unit 202 has a serious impact on timing.
The third writing port 304 is coupled between the third execution unit 302 and the register file 214. The third writing port 304 is configured to receive the third execution result from the third execution unit 302 and write the third execution result back to the third register unit in the register file 214 based on the third writing address. The third writing port 304 is coupled to all the register units 216 in the register file 214 via the third path 306. The third path 306 includes a third wiring set, and each wiring of the third wiring set is respectively coupled between the third writing port 304 and all the register cells 216 in the register file 214. In this way, any register unit 216 in the register file 214 can be directly accessed via the third writing port 304.
According to the embodiments of the present disclosure, for execution units that are frequently used or whose path feedback to the fetching unit has a serious impact on timing, a corresponding writing port is additionally allocated, In this way, the instruction execution result output by the third execution unit 302 can be directly written back to any register unit 216 in the register file 214 via the third writing port 304, thereby improving the write-back efficiency without affecting the timing of the processor, and improving processor performance.
In some embodiments, the processing apparatus 300 includes multiple execution units, and the execution delays of the multiple execution units are different from each other. In this case, multiple writing ports can be set, and each writing port is coupled to a corresponding one of the multiple execution units. In addition, each writing port can access any register unit in the register file.
The fourth execution unit 402 is configured to receive the decoded fourth type of instruction and execute the instruction. The fourth execution unit 402 is configured to output the fourth execution result with a fourth execution delay. In some embodiments, the fourth execution delay and the first execution delay of the first execution unit 206 are substantially equal to a certain delay value. In some embodiments, the fourth execution delay is substantially equal to the first execution delay, and the difference between the fourth execution delay and the first execution delay is within a predetermined range. The size of the predetermined range may be equal to a predetermined fractional value of a cycle, such that write-back of the fourth execution result may not be occurred in the same cycle of the write-back of the first execution result, In some embodiments, the first execution unit 206 is an execution unit for integer operations, and the fourth execution unit 402 is an execution unit for floating point operations.
In some embodiments, the first writing port 210 is shared by the first execution unit 206 and the fourth execution unit 402. The first writing port 210 is further coupled between the fourth execution unit 402 with a fourth execution delay and the register file 214, The first writing port 210 is further configured to receive the fourth execution result from the fourth execution unit 402, and write the fourth execution result back to the fourth register unit in the register file 214 based on the fourth writing address.
According to an embodiment of the present disclosure, for multiple execution units with substantially the same execution delay, one writing port is shared. Since multiple execution units sharing the same writing port execute instructions in a pipeline, writing to the writing port in the same cycle does not occur, and it does not cause write-back correctness problems. In this way, the instruction execution result output by the fourth execution unit 402 can be directly written back to any register unit 216 in the register file 214 via the first writing port 210, thereby improving the write-back efficiency and the resources required for this configuration. The impact of consumption and efficiency is small, thereby improving instruction processing performance.
In some embodiments, the processing apparatus 400 includes a plurality of execution units, and at least two execution delays of the respective execution delays of the plurality of execution units are substantially equal to a specific delay value. In this case, one writing port among the plurality of writing ports is allocated to at least two execution units with the at least two execution delays among the plurality of execution units.
The fifth execution unit 502 is configured to receive the decoded fifth type of instruction and execute the instruction. The fifth execution unit 502 is configured to output the fifth execution result with a fifth execution delay. The fifth execution delay is different from the first execution delay of the first execution unit 206 and the second execution delay of the second execution unit 208. In some embodiments, the fifth execution unit 502 is an execution unit that is used less frequently.
The cache unit 504 is coupled to the fifth execution unit 504. The cache unit 504 is configured to receive the fifth execution result from the fifth execution unit 504. In some embodiments, the cache unit 504 is a cache queue. The cache unit 504 is configured to temporarily store the fifth execution result. In some embodiments, the depth of the cache unit 504 is small. In some. embodiments, there is no need to set a feedback path from the cache unit 504 to the fetching unit 202.
The multiplexer 506 is coupled between the cache unit 504 and the second writing port 212. The multiplexer 506 is configured to receive the fifth execution result from the cache unit 504, and transmit the fifth execution result to the second writing port 212 based on the selection signal, so that the second writing port 212 executes the fifth execution based on the fifth writing address. The result is written back to the fifth register unit in the register file 214. In some embodiments, the multiplexer 506 is configured to transmit the fifth execution result temporarily stored in the cache unit 504 to the second writing port 212 when the second writing port 212 is in an idle state, so as to write back the fifth execution result. In some embodiments, the multiplexer 506 may include an arbitration logic.
According to the embodiments of the present disclosure, for execution units whose use frequency is low and the feedback path does not need to be passed to the forward pipeline, a cache unit with a smaller depth is provided, and other writing ports are reused. In this way, the instruction execution result output by the fifth execution unit 504 can be written back to any register unit 216 in the register file 214 via the second writing port 212. In the case of not increasing the wiring between the writing port and the corresponding register unit, the write-back efficiency may be ensured, and the consumption of processor resources may be reduced, thereby improving processor performance.
According to the embodiments of the present disclosure, any configuration in the embodiments described above with reference to
In block 602, a first execution result is received via a first writing port from a first execution unit with a first execution delay, and the first execution result is written back to a first register unit in a register file based on a first writing address.
In block 604, a second execution result is received via a second writing port from a second execution unit with a second execution delay different from the first execution delay, and the second execution result is written back to a second register unit in the register file based on a second writing address. The first writing port is not coupled to the second execution unit, and the second writing port is not coupled to the first execution unit.
In some embodiments, the method 600 may further include: receiving, via a third writing port, a third execution result from a third execution unit with a third execution delay different from the first execution delay and the second execution delay; and writing the third execution result back to a third register unit in the register file based on a third writing address. The third writing port is not coupled to the first execution unit and the second execution unit.
In some embodiments, the method 600 may further include: receiving, via the first writing port, a fourth execution result from a fourth execution unit with a fourth execution delay, and to write the fourth execution result hack to a fourth register unit in the register file based on a fourth writing address. The difference between the fourth execution delay and the first execution delay is within a predetermined range.
In some embodiments, the method 600 may further include: receiving, via a cache unit, a fifth execution result from a fifth execution unit with a fifth execution delay different from the first execution delay and the second execution delay; and receiving, via a multiplexer, the fifth execution result from the cache unit, and transmitting the
fifth execution result to the second writing port based on a selection signal, so as to cause the second writing port writing the fifth execution result back to a fifth register unit in the register file based on a fifth writing address. The second execution result is received from the second execution unit through the second writing port via the multiplexer.
According to an embodiment of the present disclosure, an electronic device and a computer-readable storage medium are also provided.
Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the present disclosure described and/or required herein. In addition, the electronic device can also represent a programmable device, such as a field programmable gate array (FPGA), a programmable logic device (PLD), and the like,
As shown in
The memory 702 is a non-transitory computer-readable storage medium provided by this disclosure. The memory 702 stores instructions executable by at least one processor, so that the at least one processor executes the method for writing back an instruction execution result provided by the present disclosure. The non-transitory computer-readable storage medium of the present disclosure stores computer instructions, and the computer instructions are used to make a computer execute the method for writing back instruction execution results provided by the present disclosure.
The memory 702, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor 701 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 702, that is, implements the method for writing back instruction execution results in the foregoing method embodiments.
The memory 702 may include a storage program area and a storage data area, where the storage program area may store an operating system and application programs required for at least one function. The storage data area may store data created according to the use of the electronic device for implementing the method. In addition, the memory 702 may include a high-speed random access memory, and a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 702 may optionally include a memory remotely disposed with respect to the processor 701, and these remote memories may be connected to the electronic device for implementing the method through a network. Examples of the above network include, but are not limited to, the Internet, an Intranet, a local area network, a mobile communication network, and combinations thereof.
The electronic device 700 may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703, and the output device 704 may be connected through a bus 705 or in other ways. In
Various implementations of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific integrated circuits, programmable devices, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs, the one or more computer programs may he executed and/or interpreted on a programmable system including at least one programmable processor. The programmable processor can be a dedicated or general-purpose programmable processor that can receive data and instructions from the storage system, at least one input device, and at least one output device, and may transmit the data and instructions to the storage system, the at least one input device, and the at least one output device.
These computing programs (also known as programs, software, software applications, or code) include machine instructions of a programmable processor and may utilize high-level processes and/or object-oriented programming languages, and/or assembly/machine languages to implement these calculation procedures. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, device, and/or device used to provide machine instructions and/or data to a programmable processor (for example, magnetic disks, optical disks, memories, programmable logic devices (PLDs), including machine-readable media that receive machine instructions as machine-readable signals. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
In order to provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (e.g., a Cathode Ray Tube (CRT) or a Liquid Crystal Display (LCD) monitor for displaying information to a user); and a keyboard and pointing device (such as a mouse or trackball) through which the user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the. user. For example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or haptic feedback), and the input from the user may be received in any form (including acoustic input, sound input, or tactile input).
The systems and technologies described herein can be implemented in a computing system that includes background components (for example, a data server), or a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (For example, a user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the systems and technologies described herein), or include such background components, intermediate computing components, or any combination of front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area network (LAN), wide area network (WAN), the Internet and block-chain network.
The computer system may include a client and a server. The client and server are generally remote from each other and interacting through a communication network. The client-server relation is generated by computer programs running on the respective computers and having a client-server relation with each other. The server may be a cloud server, also known as a cloud computing server or a cloud host. The server is a host product in a cloud computing service system to solve difficult management and poor business expansion of traditional physical hosting and VPS services.
According to the technical solution of the embodiment of the present disclosure, multiple writing ports are provided for the register file, and the multiple writing ports respectively correspond to the execution delays of the multiple execution units and are respectively allocated to the execution units with corresponding execution delays. In this way, while ensuring that no write conflicts due to execution delays occur, the efficiency of writing back the execution result of the instruction is improved.
It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps described in the disclosure could be performed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the disclosure is achieved, which is not limited herein.
The above specific embodiments do not constitute a limitation on the protection scope of the disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent replacement and improvement made within the spirit and principle of this application shall be included in the protection scope of this application.
Number | Date | Country | Kind |
---|---|---|---|
202010681665.9 | Jul 2020 | CN | national |
The present application is based upon and claims priority to Chinese Patent Application No. 202010681665.9, filed on Jul. 15, 2020, the entirety contents of which are incorporated herein by reference.