Field
In one aspect, the following disclosure relates to microprocessor microarchitecture, and in a more particular aspect, to microprocessor memory access. More particularly, apparatus, systems, and methods relate to a better way of managing data regardless of whether it is in little endian or big ending format. Specifically, the apparatus, systems, and methods provide for a single load instruction and a single load instruction regardless as to a data element size encoded in the data loaded by the single load instruction or stored by the single store instruction.
Related Art
In processor architecture and implementations of processor architectures, the term “endian” refers to an ordering of data in a computer's memory. In particular, “endian” refers to a relative order of storage in memory of component elements of a multi-component data element. In many implementations, each component is a byte, and a multi-component data element is a four-byte or larger data element. There are two common types of “endian”: big and little. Big endian means that the most significant part of a value being stored is stored in the lowest (smallest) memory address. Conversely, little endian means that the least significant part of a value being stored is stored in the lowest (smallest) memory address. For example, starting from an address A in a byte-addressed memory, a 4-byte value 0A0B0C0Dh (hexadecimal) in a big endian machine would store 0A at Address A, 0B at address A+1, 0C at address A+2 and 0D at address A+3. Conversely, a little endian machine would store value 0D at Address A, and so on. While both instruction and data accesses must observe an endian convention between storage and retrieval of data, the layout of instructions in memory is more likely to be compiler automated, while storage of data may be more directly controlled by a programmer, and thus observing an appropriate endian may be more of a concern with respect to data accesses.
Both big and little endian machines are in common use. Network byte order is big endian. Some processor architectures are only big endian and some are only little endian. Some processor architectures allow either. MIPS reduced instruction set computing (RISC) architecture is an example of such an architecture.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In one aspect, this disclosure relates to a system (e.g. implemented as a processor, a processor core in a multiprocessor system, a virtualized core executing on physical processing resources, and so on) that can operate according to big endian or little endian conventions and performs load operations from memory to register and store operations from register to memory with register contents that vary based on a current endian mode of the system. Such system can support Single Instruction Multiple Data (SIMD) operations on the register, for differently sized elements, such as byte, half-word, word, and double word sized elements in a register that is 128 or more bits wide. The system loads and stores data and/or instructions without sensitivity to the size of the elements being loaded or stored, but is sensitized to endian mode (a.k.a. “endianness”). Because the register content varies based on endian mode, at least some SIMD operations are sensitized both to the endian mode and to the element size of the operation.
Such a system may be controllable through instructions determined according to an instruction set architecture (ISA). An ISA, according to some aspects of the disclosure, includes load and store instructions which can function to load data from and store data to memory for the purpose of conducting SIMD operations on the data. These load and store instructions do not have variants that are sensitized to the element size of the SIMD operation to be performed on such data. An ISA, according to some aspects of the disclosure, includes arithmetic operations that operate on different element sizes within a fixed register size (e.g., a quad-word multiplication in a 128 bit register). These instructions specify a data element size, and an execution unit that ultimately performs the instruction uses an indication of endian mode to determine where, within the source registers, particular elements to be used in the execution of that instruction are found. ISAs, according to the disclosure, also may provide instructions that search for a first appearance of a specified byte value within a register. Execution units provided to execute such instructions also are sensitized to endian mode, so that a correct search order of the register can be identified and implemented.
Some processor architectures are designed to support instructions that transfer data to/from memory, or perform operations on values in registers, but do not support instructions that perform operations on data resident in memory (i.e., instructions that perform operations may have only register sources and register destinations). Load and store instructions, respectively, either load data into registers from memory or store data to memory from registers. Architectures with these type of “load/store” and “register-register” instructions are called Reduced Instruction Set Computing (RISC) architectures.
A processor may include arithmetic units that can perform Single Instruction Multiple Data (SIMD) operations on different data widths using a register of a given size. For example, registers can be 64, 128, or 256 bits wide and so on. In an example, a processor could support SIMD operations on 8, 16, 32, or 64-bit data widths in a 128-bit register. For example, a word-sized (32 bit) multiply instruction would multiply four words of data in a 128-bit register with four words of data in another 128-bit register.
In general and as illustrated in
Thus, load and store instructions in a processor that supports both big and little endian modes have additional complexity because they must be able to differentiate between big and little endian, so that the LSB and MSB of a value (whether big or little) is moved correctly between the memory and the register(s). Currently, this also applies to loads and stores that occur when performing SIMD operations. In particular, load instructions and store instructions that operate on SIMD data of different widths require different treatment between big and little endian operation. In other words, a given register width needs to have data mapped differently for different operand widths between big and little endian machines. In practice, this means that there needs to be different instructions for SIMD loads of different widths. For example, there needs to be a different instruction for load byte, a different instruction for a load word, a different instruction for a load double word, and so on. This also means that the operational code (opcode) portion of load and store instruction needs bits to specify what size of data is stored in load and store instructions. Those of ordinary skill in this art will appreciate that opcode bits in some load and store instructions are scarce, and it is desirable to use as few of opcode bits as possible.
As would be apparent from comparing the contents of the registers in LE and BE mode, in BE mode the register needs to be loaded based on the element size of the operation that is intended to be performed. As such, a processor architecture that supports BE operation would need to be able to map the contents of memory into registers, as in the example. Thus, such an architecture needs to have load instructions that specify the data element size intended to be loaded when intending to perform some kind of operation on subsets of the data in the register, such as, but not exclusively, an SIMD operation. For example, a load instruction that is 128 bits in length and contains data elements of word size, each 32 bits, so that this example instruction contains four 32-bit word data elements. Thus, this example loads source data for an SIMD instruction so that the register 28 contains source data for four different commands to be executed. Thus,
Because SIMD instructions, such as Arithmetic Logic Unit (ALU) instructions, may contain multiple data elements, an advantage to always providing the same arrangement of register contents regardless of endian mode is that register-to-register instructions do not need to be sensitized to an endian mode. In other words, given that once data is loaded into a register according to the above description, it is normalized for endianness, so instructions that operate only on registers—i.e., that only read register(s) as sources and write a result to a register (e.g., math ops), do not need to observe an endian convention that varies based on the operating mode.
As discussed next with reference to
While this presents some advantages, it may also at times present an asymmetry between how a machine behaves in BE and LE modes. This asymmetry results from the implicit requirement that the LSB (and MSB) for both big and little endian be the same. In other words, the instructions that perform register only operations (e.g., Add or MUL) view the LSB of the register as being in the same location for both BE and LE operation.
For example, consider that array location of word data element i[0] is found at the far left of the register 40 in BE mode and the same word data element i[0] is found at the far right of the register 40 in LE mode. In prior architectures, array location of the word data element i[0] would be loaded at the far right as in the LE mode for both BE and for LE modes.
As such, the data arranging that occurs in embodiments according to
‘Processor’ and ‘Logic’, as used herein, includes but is not limited to hardware, firmware, software and/or combinations of each to perform a function(s) or an action(s), and/or to cause a function or action from another logic, method, and/or system. For example, based on a desired application or needs, logic and/or processor may include a software-controlled microprocessor, discrete logic, an application specific integrated circuit (ASIC), a programmed logic device, a memory device containing instructions or the like. Logic and/or processor may include one or more gates, combinations of gates, or other circuit components. Logic and/or a processor may also be fully embodied as software. Where multiple logics and/or processors are described, it may be possible to incorporate the multiple logics and/or processors into one physical logic (or processor). Similarly, where a single logic and/or processor is described, it may be possible to distribute that single logic and/or processor between multiple physical logics and/or processors.
The example method 400 begins at 403 by decoding a prior fetched instruction to determine its source and destination operands and what type of instruction is to be executed. Next, the decoded instruction is routed to a reservation station at 405. As discussed later, a reservation station provides a way for an out-of-order pipeline to write speculative intermediate results of the execution of an instruction before fully retiring an instruction in programming order. When the instruction requires a load from memory, a load is performed in the load store unit (LSU) pipeline, at 410, that loads one or more registers from memory. This load is performed based, at least in part on the endian mode (BE or LE). However, the load is invariant as to whether a register is loaded with SIMD or the data element size of the SIMD data. For example, based on endian mode, the register may be loaded as illustrated in
After being associated with an entry in the reservation station, the instruction is scheduled for execution at 415. As the instruction is prepared for execution, source operands associated with the instruction are obtained from corresponding registers at 423. When required, the source operand bytes/bits are aligned at 425 based, at least in part, on the endian mode as well as the data element size (byte, half-word, word, double-word) of the SIMD operation of the instruction. The instruction is executed/performed at 428 and when required the resulting byte/bits are aligned based, at least in part, on the ending mode at 430. The destination register(s) and, if needed, the reservation station are updated with the execution results at 432. When a store is needed to write results to memory, the method 400 at 434 sends data to be stored to the LSU pipeline so that the results are stored to memory based on endian mode but invariant to whether register stores SIMD data, and invariant to the size of SIMD data being stored.
Referring back to
Unlike the load instruction example of
Similar to
The fetch logic 52 pre-fetches software instructions from memory that the processor 50 will execute. These pre-fetched instructions are placed in an instruction cache 54. These instructions are later removed from the instruction cache 54 by the decode and rename logic 56 and decoded into instructions that the processor can process. These instructions are also renamed and placed in the instruction queue 58. The decoder and rename logic 56 also provides information associated with branch instructions to the branch predictor and Instruction Translation Lookaside Buffers (ITLBs) 60. The branch predictor and ILTBs 60 predict branches and provides this branch prediction information to the fetch logic 52 so instructions of predicted branches are fetched.
A re-order buffer 62 stores results of speculatively completed instructions that may not be ready to retire in programming order. The re-order buffer 62 may also be used to unroll miss-predicted branches. The reservation station(s) 68 provides a location for instructions to write their results to without requiring a register to become available. The reservation station(s) 68 also provide for register renaming and dynamic instruction rescheduling. The commit unit 60 determines when instruction data values are ready to be committed/loaded into one or more registers in the register file 72. The load and store unit 66 monitors load and store instructions to and from memory to be sure this memory data follows sequential program order, even though the processor 50 is speculatively executing instructions out of order. For example, the load and store unit will not allow a load to load data from a memory location that a pending older store instruction has not yet written.
Instructions are executed in one or more out-of-order pipeline(s) 70 that are not required to execute instructions in programming order. In general, instructions eventually write their results to the register file 72.
Processors according to the above disclosure can enjoy the following benefits (although such benefits are not necessarily present in all implementations): in some processor architectures, both load and store instructions require significantly more opcode space to express than register-to-register instructions. For example, a load with a source, destination, and a 16 bit offset would typically require five bits to identify 1 of 32 source registers as well as five bits to identify 1 of 32 destination registers and 16 address offset bits for a total of 26 bits of opcode space leaving the last six bits of the 32 bit opcode free to identify an instruction as a load. Alternatively, a register-to-register instruction without an offset would need 15 bits to specify three registers leaving more opcode bits available to identify the type of load instruction. In another configuration, an architecture may have 32 registers with each register identified with five bits, and may have an opcode of six bits. Such an architecture would use only 21 bits for encoding such a register-to-register operation. While a RISC architecture may still store that instruction in 32 bits, using only 21 bits of the space allows many more instructions to be encoded. Alternatively, more registers can be addressed, or some combination thereof. Further, the lane arranging logic 306 in the memory interface 302 (
The term ‘processor’ further includes any of a variety of machine structures that can process or handle data, including, for example, a Digital Signal Processor, fixed function circuitry, input/output (I/O), or even functional units within processor. Still further, ‘processor’ includes virtualized execution resources, such that one set of physical execution resources can be abstracted as multiple physical processors. An operative distinction is whether support for prefetching data into relatively local storage, from relatively remote storage is provided, and subsidiary distinction that may call for implementing the disclosure are the capability of reordering of demand requests, from out of order processing, multithreading, or both.
Modern general purpose processors regularly require in excess of two billion transistors to be implemented, while graphics processing units may have in excess of five billion transistors. Such transistor counts are likely to increase. Such processors have used these transistors to implement increasing complex operation reordering, prediction, more parallelism, larger memories (including more and bigger caches) and so on. As such, it becomes necessary to be able to describe or discuss technical subject matter concerning such processors, whether general purpose or application specific, at a level of detail appropriate to the technology being addressed. In general, a hierarchy of concepts is applied to allow those of ordinary skill to focus on details of the matter being addressed.
For example, high level features, such as what instructions a processor supports conveys architectural-level detail. When describing high-level technology, such as a programming model, such a level of abstraction is appropriate. Microarchitecture detail describes high level detail concerning an implementation of architecture (even as the same microarchitecture may be able to execute different ISAs). Yet, microarchitecture detail typically describes different functional units and their interrelationship, such as how and when data moves among these different functional units. As such, referencing these units by their functionality is also an appropriate level of abstraction, rather than addressing implementations of these functional units, since each of these functional units may themselves comprise hundreds of thousands or millions of gates. When addressing some particular feature of these functional units, it may be appropriate to identify substituent functions of these units, and abstract those, while addressing in more detail the relevant part of that functional unit.
Eventually, a precise logical arrangement of the gates and interconnect (a netlist) implementing these functional units (in the context of the entire processor) can be specified. However, how such logical arrangement is physically realized in a particular chip (how that logic and interconnect is laid out in a particular design) still may differ in different process technology and for a variety of other reasons. Many of the details concerning producing netlists for functional units as well as actual layout are determined using design automation, proceeding from a high level logical description of the logic to be implemented (e.g., a “hardware description language”).
The term “circuitry” does not imply a single electrically connected set of circuits. Circuitry may be fixed function, configurable, or programmable. In general, circuitry implementing a functional unit is more likely to be configurable, or may be more configurable, than circuitry implementing a specific portion of a functional unit. For example, an Arithmetic Logic Unit (ALU) of a processor may reuse the same portion of circuitry differently when performing different arithmetic or logic operations. As such, that portion of circuitry is effectively circuitry or part of circuitry for each different operation, when configured to perform or otherwise interconnected to perform each different operation. Such configuration may come from or be based on instructions, or microcode, for example.
In all these cases, describing portions of a processor in terms of its functionality conveys structure to a person of ordinary skill in the art. In the context of this disclosure, the term “unit” refers, in some implementations, to a class or group of circuitry that implements the functions or functions attributed to that unit. Such circuitry may implement additional functions, and so identification of circuitry performing one function does not mean that the same circuitry, or a portion thereof, cannot also perform other functions. In some circumstances, the functional unit may be identified, and then functional description of circuitry that performs a certain feature differently, or implements a new feature, may be described. For example, a “decode unit” refers to circuitry implementing decoding of processor instructions. The description explicates that in some aspects, such decode unit, and hence circuitry implementing such decode unit, supports decoding of specified instruction types. Decoding of instructions differs across different architectures and microarchitectures, and the term makes no exclusion thereof, except for the explicit requirements of the claims. For example, different microarchitectures may implement instruction decoding and instruction scheduling somewhat differently, in accordance with design goals of that implementation. Similarly, there are situations in which structures have taken their names from the functions that they perform. For example, a “decoder” of program instructions, that behaves in a prescribed manner, describes structure supports that behavior. In some cases, the structure may have permanent physical differences or adaptations from decoders that do not support such behavior. However, such structure also may be produced by a temporary adaptation or configuration, such as one caused under program control, microcode, or other source of configuration.
Different approaches to design of circuitry exist. For example, circuitry may be synchronous or asynchronous with respect to a clock. Circuitry may be designed to be static or be dynamic. Different circuit design philosophies may be used to implement different functional units or parts thereof. Absent some context-specific basis, “circuitry” encompasses all such design approaches.
Although circuitry or functional units described herein may be most frequently implemented by electrical circuitry, and more particularly, by circuitry that primarily relies on a transistor implemented in a semiconductor as a primary switch element, this term is to be understood in relation to the technology being disclosed. For example, different physical processes may be used in circuitry implementing aspects of the disclosure, such as optical, nanotubes, micro-electrical mechanical elements, quantum switches or memory storage, magneto resistive logic elements, and so on. Although a choice of technology used to construct circuitry or functional units according to the technology may change over time, this choice is an implementation decision to be made in accordance with the then-current state of technology. This is exemplified by the transitions from using vacuum tubes as switching elements to using circuits with discrete transistors, to using integrated circuits, and advances in memory technologies, in that while there were many inventions in each of these areas, these inventions did not necessarily fundamentally change how computers fundamentally worked. For example, the use of stored programs having a sequence of instructions selected from an instruction set architecture was an important change from a computer that required physical rewiring to change the program, but subsequently, many advances were made to various functional units within such a stored-program computer.
Functional modules may be composed of circuitry, where such circuitry may be a fixed function, configurable under program control or under other configuration information, or some combination thereof. Functional modules themselves thus may be described by the functions that they perform, to helpfully abstract how some of the constituent portions of such functions may be implemented.
In some situations, circuitry and functional modules may be described partially in functional terms, and partially in structural terms. In some situations, the structural portion of such a description may be described in terms of a configuration applied to circuitry or to functional modules, or both.
Although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, a given structural feature may be subsumed within another structural element, or such feature may be split among or distributed to distinct components. Similarly, an example portion of a process may be achieved as a by-product or concurrently with performance of another act or process, or may be performed as multiple, separate acts in some implementations. As such, implementations according to this disclosure are not limited to those that have a 1:1 correspondence to the examples depicted and/or described.
Above, various examples of computing hardware and/or software programming were explained, as well as examples of how such hardware/software can intercommunicate. These examples of hardware or hardware configured with software and such communication interfaces provide means for accomplishing the functions attributed to each of them. For example, a means for performing implementations of software processes described herein includes machine-executable code used to configure a machine to perform such process. Some aspects of the disclosure pertain to processes carried out by limited configurability or fixed function circuits and in such situations, means for performing such processes include one or more of special purpose and limited-programmability hardware. Such hardware can be controlled or invoked by software executing on a general purpose computer.
Implementations of the disclosure may be provided for use in embedded systems, such as televisions, appliances, vehicles, or personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, game consoles, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets and the like.
In addition to hardware embodiments (e.g., within or coupled to a Central Processing Unit (“CPU”), microprocessor, microcontroller, digital signal processor, processor core, System on Chip (“SOC”), or any other programmable or electronic device), implementations may also be embodied in software (e.g., computer readable code, program code, instructions and/or data disposed in any form, such as source, object or machine language) disposed, for example, in a computer usable (e.g., readable) medium configured to store the software. Such software can enable, for example, the function, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. For example, this can be accomplished through the use of general programming languages (e.g., C, C++), GDSII databases, hardware description languages (HDL) including Verilog HDL, VHDL, SystemC Register Transfer Level (RTL), and so on, or other available programs, databases, and/or circuit (i.e., schematic) capture tools. Embodiments can be disposed in computer usable medium including non-transitory memories such as memories using semiconductor, magnetic disk, optical disk, ferrous, resistive memory, and so on.
As specific examples, it is understood that implementations of disclosed apparatuses and methods may be implemented in a semiconductor intellectual property core, such as a microprocessor core, or a portion thereof, embodied in a Hardware Description Language (HDL), that can be used to produce a specific integrated circuit implementation. A computer readable medium may embody or store such description language data, and thus constitute an article of manufacture. A non-transitory machine readable medium is an example of computer-readable media. Examples of other embodiments include computer readable media storing Register Transfer Language (RTL) description that may be adapted for use in a specific architecture or microarchitecture implementation. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software that configures or programs hardware.
Also, in some cases, terminology has been used herein because it is considered to more reasonably convey salient points to a person of ordinary skill, but such terminology should not be considered to imply a limit as to a range of implementations encompassed by disclosed examples and other aspects. A number of examples have been illustrated and described in the preceding disclosure. By necessity, not every example can illustrate every aspect, and the examples do not illustrate exclusive compositions of such aspects. Instead, aspects illustrated and described with respect to one figure or example can be used or combined with aspects illustrated and described with respect to other figures. As such, a person of ordinary skill would understand from these disclosures that the above disclosure is not limiting as to constituency of embodiments according to the claims, and rather the scope of the claims define the breadth and scope of inventive embodiments herein. The summary and abstract sections may set forth one or more but not all exemplary embodiments and aspects of the invention within the scope of the claims.