The present invention relates to computer systems; more particularly, the present invention relates to central processing units (CPUs).
Vector processors are designed to have a specific data width. Recently 256 bit (“b”) data width processors have been designed, replacing 128 b systems. In such processors, the execution data path may not match a maximum vector length (VL) (e.g., 256 b path for a maximum VL of 512 b). Instructions, such as vector streaming single instruction, multiple data extension (VSSE) instructions may be contain multiple micro-operations (μops), each able to operate on the full data path width. For instance, a VSSE instruction may decoded into two μops when fetched by a microprocessor, each μop being able to operate on 256 b of data.
However, all VSSE operations may not be performed on the full 512 b vector length. For example, various algorithms may be ported to VSSE-based code using a 128 b data length for compatibility and simplicity, which may cause the VSSE code to run slower than code using, for example, non-vector single streaming instruction, multiple data (SSE) instructions. In some applications, it may not be advantageous for VSSE code to run slower than corresponding SSE versions of the code.
The invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which:
A vector length (VL) tracker in a CPU is described. In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present invention. However, it will be apparent to one skilled in the art that embodiments of the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring embodiments of the present invention.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some portions of the detailed descriptions that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
The instructions of the programming language(s) may be executed by one or more processing devices (e.g., processors, controllers, control processing units (CPUs).
In one embodiment, main system memory 115 includes dynamic random access memory (DRAM); however, main system memory 115 may be implemented using other memory types. Additional devices may also be coupled to bus 105, such as multiple CPUs and/or multiple system memories. MCH 110 is coupled to an input/output control hub (ICH) 140 via a hub interface. ICH 140 provides an interface to input/output (I/O) devices within computer system 100.
Dispatch/execute unit 220 is an out of order unit that accepts a dataflow stream, schedules execution of the uops subject to data dependencies and resource availability and temporarily stores the results of speculative executions. In other embodiments, the dispatch/execute unit 220 may be separate functional units, or include other functional units, such as a retire unit. Furthermore, in other embodiments, the dispatch/execute unit 220 may perform in-order operations in addition to or instead of out-of-order operations. Retire unit 230 is an in order unit that commits (retires) the temporary, speculative results to permanent states. In some embodiments, the retire unit 230 may be incorporated with other functional units.
In the embodiment illustrated in
According to one embodiment, allocator 360 includes a vector length (VL) tracker 362 to track a VL value by determining a magnitude of the value, which may indicate the length of a vector (e.g., 256 b or lower, or higher than 256 b). In one embodiment, the VL value is used to set the vector length such that subsequent instructions will have a particular length corresponding to the value.
In another embodiment, setting a new VL value is performed via one or more μops that dynamically collect a new VL value by receiving the VL value from a register (e.g., VSSE arch register) during execution of the one or more μops. A μop that sets a VL value may be referred to as a “VL writer”. In yet another embodiment, a VL value may be determined from an immediate field within an instruction.
According to one embodiment, VL tracker 362 records whether the VL value is 256 b or lower, or higher than 256 b (e.g., greater than 32 b). If the VL value is 256 b or lower, a certain number corresponding μops may be generated, whereas if the VL value is more than 256 b, another number of corresponding uops may be generated. For example, in one embodiment, if the VL value is 256 b or lower, one μop is generated. Otherwise two μops are generated. In some embodiments, if the VL writer is allocated at allocator 360 with a static (or unchanging) value, VL tracker 362 determines the number of μops that will be generated.
In one embodiment, if the VL writer is allocated with a dynamic (or changing) value, tracker 362 goes into a pending state where tracker 362 predicts that the VL will be greater than 256 b. Consequently, a certain number of μops, such as two μops, are generated. After the VL writer is executed the new VL value is broadcasted to allocator 360 and tracker 362 goes into the corresponding state (greater than 32 B), where it continues to operate until a new VL value is received.
In one embodiment, μop execution may occur in a different order than the program order from which the corresponding instructions originated. In such an embodiment, VL values and corresponding state information may not be received by the allocator 360 until the VL writer is actually retired by the retirement unit. In another embodiment, multiple VL writers may exist concurrently within a processor's pipeline.
In such an embodiment, VL tracker 362 may track an identification indicator (ID) of the last allocated VL, causing an updated VL value to be stored in the VL tracker in response to the last VL writer being executed. In one embodiment, the VL tracker 362 updates the VL if the stored ID matches the ID of a particular VL writer that has been executed and whose corresponding VL value has been communicated to the VL tracker.
In some embodiments, VL tracker 362 may use the stored ID to handle branch mispredictions if, for example, the VL writer is in a branch that has been mispredicted. If the branch is mispredicted, tracker 362 determines if the remembered ID was available prior to the branch being generated (e.g., older). In one embodiment, if the ID is older, the VL value associated with the ID may be considered to be the correct value.
If the ID was available after the branch being generated (e.g., younger), the ID is discarded or otherwise not used. Once the ID is discarded, tracker 362 may return to the pending state described above, in which it may be presumed that VL will be greater than 256 b. Alternatively, tracker 362 may restore and use a previous VL value for subsequent VSSE tracking operations.
According to one embodiment, VL tracker 362 also handles narrow vectors where all of the bits of a destination register are higher in order than a vector length to be zeroed. For narrow vectors a problem may occur in which one μop may update the lower 256 b of the vector register, while the higher 256 b is not being affected. Therefore, if the VL value is changed back to 512 b and another vector μop is to read the full vector register, the validity of the higher bit values are uncertain since only the lower 256 b have been updated.
In one embodiment, VL tracker 362 maintains a zero bit for the higher 256 b to indicate that the higher 256 bits are to be read as zero following narrow vectors. In this embodiment, the zero bit is stored in RAT 350. Thus, for every VSSE arch register, a bit is added in RAT 350 to record whether the upper 256 are all zeroes. The bit is set whenever the VL tracker 362 state is greater than 32 B and cleared when in the opposite state.
Embodiments of the invention described above may improve performance of processing narrow vectors and may enable porting of software using SSE instructions to software using VSSE instructions that use the same vector length while maintaining substantially equivalent performance.
Whereas many alterations and modifications of the present invention will no doubt become apparent to a person of ordinary skill in the art after having read the foregoing description, it is to be understood that any particular embodiment shown and described by way of illustration is in no way intended to be considered limiting. Therefore, references to details of various embodiments are not intended to limit the scope of the claims which in themselves recite only those features regarded as essential to the invention.