The present embodiments relate generally to the field of data processing systems. More specifically, the embodiments relate to history buffers and implementation of the history buffers in a central processing unit.
Central processing units (CPUs) may implement multi-threaded core technologies that utilize one or more execution lanes. Each execution lane utilizes a register file (RF) and a history buffer (HB) that contains architected register data. The HB is a component of an execution unit that preserves register contents when a register is a target of a newly dispatched instruction and the target register's contents require preservation, such as during a branch instruction.
Instructions are chronologically tagged, e.g. by the order in which they were fetched. Once the instructions are fetched and tagged, the instructions are then executed to generate results, which are also tagged. The RF may contain results from the most recently executed instructions, i.e. newer register data, and the HB may contain results from previously executed instructions, i.e. older register data. The older register data is displaced by newer register data from one or more entries in the RF to one or more entries of the HB. In some embodiments, a limited number of entries in the HB may reach a memory capacity and impact CPU performance.
There are physical limitations present with respect to configuration and use of the HB. Namely, each individual HB must contain one write port for each results bus. However, multiple write ports are expensive to implement in that the circuit area grows with each added write port. Accordingly, there is a need to balance the physical limitations of the circuit area with management of HBs and associated register data.
The embodiments described herein include a system, computer program product, and a method for processing instructions responsive to a split level history buffer in a central processing unit.
In one aspect, a computer system is provided with a central processing unit (CPU) having a history buffer split into multiple levels, including first, second, and third levels. The history buffer includes an associated history buffer (HB) controller with logic and/or program instructions for reading and writing data in the history buffer. Similarly, the CPU includes a register file and an associated register file (RF) controller with logic and/or program instructions for reading and writing data to the register file. The RF controller is configured to fetch a first instruction, and tag the fetched first instruction, and allocate space for the first instruction in an entry of a register file. The RF controller further fetches a second instruction, and tags the fetched second instruction. Thereafter, the RF controller evict the first instruction from the entry of the register file, allocates space for the second instruction in the entry of the register file, and communicates with the HB controller to store the first instruction in the first level of the history buffer. In response to generation of a result for the first instruction, the HB controller moves the first instruction from the first level of the history buffer, stores the generated result in the second level of the history buffer, and invalidates the entry of the first instruction in the first level of the history buffer. Responsive to instruction completion and identification of pre-transactional memory data contained in the first instruction, the HB controller moves the first instruction from the second level to the third level of the history buffer, with the moved first instruction including pre-transactional memory data. In response to movement of the first instruction to the third level of the history buffer, the HB controller invalidates the entry of the first instruction in the second level of the history buffer.
In another aspect, a computer program product is provided for processing instructions responsive to a split history buffer of a central processing unit (CPU). The computer program product comprises a computer readable storage device having program code embodied therewith, the program code executable by a processing unit. The history buffer is configured with multiple levels, including first, second, and third levels. A register file is configured with a register file (RF) controller configured with logic and/or program instructions to read and write instructions to the register file. Similarly, the history buffer is configured with an associated controller, referred to as a history buffer (HB) controller, with logic and/or program instructions to read and write data to the history buffer. Program instructions are provided and managed by the RF controller to fetch a first instruction, tag the fetched first instruction, and allocate space for the first instruction in an entry of a register file, and to fetch a second instruction, tag the fetched second instruction, evict the first instruction from the entry of the register file, and allocate space for the second instruction in the entry of the register file. Program instructions are provided and managed by the HB controller to allocate space for the first instruction in the first level of the history buffer. In response to generation of a result for the first instruction, the HB controller logic will move the first instruction from the first level of the history buffer, and allocate space for the first instruction, including the generated result, in the second level of the history buffer. Responsive to movement of the first instruction to the second level of the history buffer, the HB controller logic invalidates the entry of the first instruction in the first level of the history buffer. Similarly, in response to instruction completion and identification of pre-transactional memory data contained in the first instruction, the HB controller logic moves the first instruction from the second level to the third level of the history buffer. Responsive to movement of the first instruction to the third level of the history buffer, the HB controller logic invalidates the entry of the first instruction in the second level of the history buffer.
In yet another aspect, a method is provided for processing instructions responsive to a split history buffer of a central processing unit (CPU). The history buffer is configured with multiple levels, including first, second, and third levels. A first instruction is fetched, tagged, and space for the first instruction is allocated in an entry of a register file. Similarly, a second instruction is fetched and tagged. The first instruction is evicted from the entry of the register file. In addition, space is allocated in the entry of the register file for the second instruction, and the first instruction is stored in the first level of the history buffer. Responsive to generating of a result for the first instruction, the first instruction is moved from the first level of the history buffer, which further includes storing the first instruction and the generated result in the second level of the history buffer and invalidating the entry of the first instruction in the first level of the history buffer. Similarly, responsive to instruction completion, the first instruction is moved from the second level to the third level of the history buffer, which further includes moving first instruction including pre-transactional memory data. Finally, responsive to movement of the first instruction to the third level of the history buffer, the entry of the first instruction in the second level of the history buffer is invalidated.
These and other features and advantages will become apparent from the following detailed description of the presently preferred embodiment(s), taken in conjunction with the accompanying drawings.
The drawings referenced herein form a part of the specification. Features shown in the drawings are meant as illustrative of only some embodiments, and not of all embodiments, unless otherwise explicitly indicated.
It will be readily understood that the components of the present embodiment, as generally described and illustrated in the Figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the apparatus, system, and method, as presented in the Figures, is not intended to limit the scope of the embodiments, as claimed, but is merely representative of selected embodiments.
Reference throughout this specification to “a select embodiment,” “one embodiment,” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present embodiments. Thus, appearances of the phrases “a select embodiment,” “in one embodiment,” or “in an embodiment” in various places throughout this specification are not necessarily referring to the same embodiment.
The illustrated embodiments will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of devices, systems, and processes that are consistent with the embodiments as claimed herein.
The embodiments shown and described below provide efficient and cost-effective systems and methods for managing architected register data within central processing units. A split history buffer is implemented, including a first level history buffer (L1), a second level history buffer (L2), and a third level history buffer (L3). Each of the history buffers have a specific design characteristic and function that is cognizant of limited circuit design space.
Referring to
Instruction fetch unit (120) fetches one or more instructions from program memory (not shown), and transmits the one or more fetched instructions and a unique multi-bit ITAG, i.e. a mechanism used to tag or identify instructions, tagging each of the one or more fetched instructions to register file (140), e.g. storing the instructions as an entry in the register file (140). Each of the one or more fetched instructions is represented by a numeric string describing an operation to system (110) to execute. In one embodiment, instruction fetch unit (120) may utilize a program counter (not shown) to tag each of the one or more fetched instructions. For example, three instructions fetched from program memory may be tagged by three unique multi-bit ITAGs indicating an order in which the three instructions were fetched. In one embodiment, instruction fetch unit (120) may include a decoding component to partition the fetched instructions for subsequent execution. In a further embodiment, the instruction fetch unit (120) may support branch prediction.
Register file (140) contains the one or more fetched instructions prior to dispatching each of the one or more fetched instructions to execution unit (150). In one embodiment, the register file (140) is an array of processor registers having one or more entries available to store the one or more fetched instructions. As shown, the register file (140) includes a register file controller (148), hereinafter referred to as RF controller, to implement logic and associated program instructions for writing entries into the register file array and reading the entries out of the register file array when evicting to the history buffer (130). It is understood that the register file (140) may have an older instruction entry. Every instruction evicts the ‘prior’ data. In an example with only two instructions and both instructions targeting the same register, the second instruction evicts ‘prior data’ written by the first instruction. However, in this same two instruction example, if the first and second instructions target different registers, e.g. first and second registers, then the second instruction will not evict the first instructions. Each of the first and second instructions will evict whatever prior data was in the respective register file. Each entry of the register file (140) contains at least, a fetched instruction tagged by an ITAG and the ITAG. Entry data of an entry in the register file (140) may be evicted to the split history buffer (130) through logic associated with the RF controller (148), as shown and described in
The execution unit (150) produces and generates a result for each of the one or more tagged instructions dispatched by the register file (140), e.g. dispatched by the RF controller (148). In one embodiment, the execution unit (150) generates a result for a tagged instruction by performing operations and calculations specified by operation code of the tagged instruction. Execution unit (150) includes functional unit (162) and functional unit (172), which corresponds to reservation stations (160) and (170), respectively. In one embodiment, execution unit (150) and components therein are each connected, such that each component is configured to perform at least a portion of a desired operation during a clock cycle.
Reservation stations (160) and (170) enable the system (110) to process and execute instructions out of order. In one embodiment, reservation stations (160) and (170) facilitate parallel execution of instructions. For example, reservation stations (160) and (170) permit system (110) to fetch and re-use a data value once the data value has been computed by one or both of functional units (162) and (172). In one embodiment, the system (110) uses reservation stations (160) and (170) so that the system (110) does not have to wait for a data value to be stored in the split history buffer (130) and re-read. In one embodiment, reservation stations (160) and (170) are connected to functional units (162) and (172), respectively, for dynamic instruction scheduling. Furthermore, reservation stations (160) and (170) may enable the system (110) to have advanced capabilities for processing and executing one or more tagged instructions. Reservations stations (160) and (170) may contain necessary logic used to determine a manner to execute a tagged instruction once the tagged instruction is dispatched from register file (140).
Functional units (162) and (172) output result data for tagged instructions dispatched from register file (140). In one embodiment, functional unit (162) executes tagged instructions to generate a result for the tagged instruction. The functional unit (172) executes the other tagged instruction to generate another result for the other tagged instruction. In one embodiment, functional units (162) and (172) are components, e.g. adders, multipliers, etc., connected to reservation stations (160) and (170), respectively. For example, functional units (162) and (172) may be arithmetic logic units (ALUs) or floating point units (FLUs). In another embodiment, functional units (162) and (172) may generate a plurality of results in parallel, independently, and/or sequentially. Similarly, in one embodiment, additional functional units and associated reservation stations may be implemented in the system (110), and as such, the quantity shown and described herein should not be considered limiting.
As shown, the split history buffer (130) is comprised of three levels, including the L1 (132), L2 (134), and L3 (136). The levels (132)-(136) contain one or more entries storing data from register file (140). The configuration of the split history buffer (130), e.g. L1 (132), L2 (134), and L3 (136), is a history buffer that has been partitioned into three levels, e.g. levels, to effectively increase a number of entries in the split history buffer (130) containing one or more tagged instructions and additional information for the one or more tagged instructions. System (110) utilizes the levels (132)-(136) to store one or more tagged instructions and additional information for each of the one or more tagged instructions. Each entry data evicted from register file (140) are stored in the split history buffer (130) prior to the system (110) performing a subsequent action, e.g. completion, flushing, restoration, etc. System (110) utilizes logic and other signals to ensure that L1 (132), L2 (134), and L3 (136) contain evicted entry data in a correct chronological order.
Each level in the split history buffer has associated characteristics and functionality. The L1 and L2, (132) and (134), respectively, support main line execution and performance profile. The L3, (136), is configured to support Transactional Memory (TM). It is understood in the art that TM is a shared-memory synchronization constructions that allows process-threads to perform storage operations that appear to be atomic to other process-threads or applications. TM is a construct that allows execution of lock-based critical sections of code without acquiring a lock. The L1 (132) includes all the write ports necessary to sink multiple writeback buses. The L1, (132), moves an entry to the L2, (134), only after the valid data has been written by the writeback buses. Responsive to movement of the L1 (132) entry to the L2 (134), the HB controller (138) invalidates the entry in the L1 (132). All writeback ITAG compares occur on a fewer number of L1 (132) entries. The L2 (134) is configured with less write ports than the L1 (132). In one embodiment, the L2 (134), is configured with one write port each for the number of entries that can be moved from the L1 (132) to the L2 (134) in any given cycle. Similarly, in one embodiment, the L2 (134) is sized just large enough to support in-flight execution while not in a TM mode.
The L3 (136) is configured to contain all pre-TM states after instruction completion. Data can move from the L2 (134) to the L3 (136) when the core is executing a TM code and the pre-TM states are already completed and removed from an associated completion table; the L3 (136) is idle in all other modes. The L3 (136) is physically configured to contain data for all architected logical registers (LREGs) for general purpose registers (GPRs) and vector and scalar registers (VSRs). It is understand that an associated transaction either passes or fails. If the transaction passes, then all pre-TM data in the L3 (136) can be discarded, and if the transaction fails, then valid L3 (136) entries can be read out and restored to the main register table. There is only one entry per LREG in the L3 (136). Since the L3 (136) only contains completed pre-TM data, the L3 (136) does not need write back, completion support, or flush support. Details of the functionality of the L3 (136) are shown and described in
Referring to
The L1 (132) is a first level history buffer with one or more entries containing evicted entry data. In one embodiment, evicted entry data are transmitted from the register file (140) to the L1 (132) responsive to an eviction operation, as shown and described in
The L2 (134) contains flush and complete compares. More specifically, the L1 moves an entry to the L2 after valid data has been written by a writeback bus (210). Referring to
The L2 level keeps the data until the evictor of the LREG is completed (212). Referring to
An entry in the L3 does not contain flush or complete logic, e.g. the L3 (136) is not speculative. The L3 (136) supports TM, and is limited to pre-TM data. There is only one entry per LREG in the L3 (136). Following step (212) an entry in the L2 with the pre_TM bit set, e.g. the evictor and its own ITAGs are completed, is identified (214) and moved to the L3 (216). At step (216) the pre-transactional memory data contained in the L2 entry is verified prior to movement to the L3. The LREG of the entry is used as an index address to write into the L3. After the entry is written, its pre TM bit is set to 1 to indicate that this LREG is a pre TM entry to be restored it the transaction fails (218). The entry moved to the L3 level remains in the L3 until the transaction end, e.g. Tend, is completed (220). Accordingly, entries are selectively moved from the L2 level to the L3 level, and for each moved entry the associated entry in the L2 level is invalidated.
As shown in
Referring to
Computer system (700) includes communications fabric (702), which provides for communications between one or more processors (704), memory (706), persistent storage (708), communications unit (712), and one or more input/output (I/O) interfaces (714). Communications fabric (702) can be implemented with any architecture designed for passing data and/or control information between processors (such as microprocessors, communications and network processors, etc.), system memory, peripheral devices, and any other hardware components within a system. For example, communications fabric (702) can be implemented with one or more buses.
Memory (706) and persistent storage (708) are computer-readable storage media. In an embodiment, memory (706) includes random access memory (RAM) (716) and cache memory (718). In general, memory (706) can include any suitable volatile or non-volatile computer-readable storage media. Software is stored in persistent storage (708) for execution and/or access by one or more of the respective processors (704) via one or more memories of memory (706). In this document, the terms “computer program medium,” “computer usable medium,” and “computer readable medium” are used to generally refer to media such as memory (706) and persistent storage (708).
Persistent storage (708) may include, for example, a plurality of magnetic hard disk drives. Alternatively, or in addition to magnetic hard disk drives, persistent storage (708) can include one or more solid state hard drives, semiconductor storage devices, read-only memories (ROM), erasable programmable read-only memories (EPROM), flash memories, or any other computer-readable storage media that is capable of storing program instructions or digital information.
The media used by persistent storage (708) can also be removable. For example, a removable hard drive can be used for persistent storage (708). Other examples include optical and magnetic disks, thumb drives, and smart cards that are inserted into a drive for transfer onto another computer-readable storage medium that is also part of persistent storage (708).
Communications unit (712) provides for communications with other computer systems or devices via a network. In this exemplary embodiment, communications unit (712) includes network adapters or interfaces such as a TCP/IP adapter cards, wireless Wi-Fi interface cards, or 3G or 4G wireless interface cards or other wired or wireless communication links. The network can comprise, for example, copper wires, optical fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. Software and data used to practice embodiments can be downloaded to a computer system through communications unit (712) (e.g., via the Internet, a local area network or other wide area network). From communications unit (712), the software and data can be loaded onto persistent storage (708).
One or more I/O interfaces (714) allow for input and output of data with other devices that may be connected to computer system (700). For example, I/O interface (714) can provide a connection to one or more external devices (720) such as a keyboard, computer mouse, touch screen, virtual keyboard, touch pad, pointing device, or other human interface devices. External devices (720) can also include portable computer-readable storage media such as, for example, thumb drives, portable optical or magnetic disks, and memory cards. I/O interface (714) also connects to display (722).
Display (722) provides a mechanism to display data to a user and can be, for example, a computer monitor. Display (722) can also be an incorporated display and may function as a touch screen, such as a built-in display of a tablet computer.
The present embodiments may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present embodiments.
The system shown and described above in
Indeed, executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different applications, and across several memory devices. Similarly, operational data may be identified and illustrated herein within the tool, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, as electronic signals on a system or network.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided, such as examples of agents, to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the embodiments can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the embodiments.
Computer programs (also called computer control logic) are stored in memory (706) and/or persistent storage (708). Computer programs may also be received via communications unit (712). Such computer programs, when run, enable the computer system to perform the features of the present embodiments as discussed herein. In particular, the computer programs, when run, enable the processor(s) (704) to perform the features of the computer system. Accordingly, such computer programs represent controllers of the computer system.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present embodiments.
Aspects of the present embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to the various described embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present embodiments has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the embodiments. The embodiments were chosen and described in order to best explain the principles and the practical application of the embodiments, and to enable others of ordinary skill in the art to understand the embodiments with various modifications as are suited to the particular use contemplated. Accordingly, the implementation of the multi-level history buffer with different levels therein having specified functionality supports and enables reduced area on an associated substrate while supporting power consumption.
It will be appreciated that, although specific embodiments have been described herein for purposes of illustration, various modifications may be made without departing from the spirit and scope of the embodiments. In particular, in one embodiment the split history buffer may be implemented with a different quantity of levels. For example, the split history buffer may be configured with a first level, L1 similar to the L1 shown and described above, with a second level L2 dedicated to pre-transactional memory data, or the split history buffer may be configured with four levels, including a first level, L1, a second level, L2, a third level L3, and a fourth L4, with the L4 dedicated to pre-transactional memory data. In another embodiment, the third level, L3 of the history buffer dedicated to pre-transactional memory data may be in a different storage medium than the first and second levels, L1 and L2, respectively. For example, the pre-transactional memory data may be stored in the cache, scratch-pad memory, or in off-chip memory. The same movement algorithms would apply to control movement from the L2 level to the different storage medium of the L3. Accordingly, the scope of protection of these embodiments is limited only by the following claims and their equivalents.