Instruction set encodings may affect the performance of central processing units (CPUs) with a trade-off of maximizing the amount of work that can be performed in parallel against minimizing program size, which reduces the resources needed to execute a program. Some existing systems use fixed-width instructions, such as 32-bit length instructions. These systems support superscalar computer architectures to fetch and decode multiple instructions simultaneously. The instructions may then be executed in parallel. However, this type of system requires all instructions, including simple instructions, to have the same length. For example, if all instructions are 32 bits, even simple instructions that only require a few bits will be increased to the fixed length of 32 bits, which increases program size by making simple instructions longer than necessary.
Other existing systems use variable-length instructions, which can be difficult to decode in parallel. This difficulty arises because the system needs to decode a first instruction to find the instruction length before the system can determine where a second instruction begins. Although there are techniques to reduce this limitation, these techniques may require significant additional processing or may require a larger area on the silicon chip to implement caches for processing various instructions of different lengths.
This document discloses systems and methods related to a parallel decode instruction set computer architecture with variable-length instructions. In some aspects, a hybrid encoding approach is used to avoid wasted resources when performing parallel decoding with variable-length instructions and avoids the inefficient encoding of fixed-length instructions. For example, the hybrid encoding approach may use an instruction format that includes a fixed-length prefix and a variable-length suffix for each instruction.
In various aspects, a processor receives an instruction block for execution. A decoder identifies multiple fixed-length prefixes in the instruction block and identifies multiple variable-length suffixes in the instruction. Each of the multiple fixed-length prefixes can be associated with one of the variable-length suffixes. The instruction block is then executed based on the plurality of variable-length suffixes. By so doing, the described systems and methods may be implemented in a manner that reduces program size and reduces the required area on the silicon chip.
This Summary is provided to introduce simplified concepts for implementing a parallel decode instruction set computer architecture with variable-length instructions. The simplified concepts are further described below in the Detailed Description. This Summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in determining the scope of the claimed subject matter.
The details of one or more aspects of the described systems and methods are described below. The use of the same reference numbers in different instances in the description and the figures indicate similar elements:
The described systems and methods provide a parallel decode instruction set computer using variable-length instructions. These instructions may be referred to as hybrid instructions that include a prefix and a suffix. In some aspects, the systems and methods may separate each hybrid instruction into two parts, a prefix, and a suffix. The prefix may contain data that indicates the length of an associated or corresponding suffix. As described herein, each prefix may have a fixed length (e.g., a fixed number of bits) at a fixed location within the prefix portion of the instruction. Each suffix may have a variable length. Thus, each prefix indicates the length (e.g., in bits) of an associated suffix. Other sections of the prefix are optional and may depend on how the particular instruction set is defined. For example, a prefix may include an instruction identifier and data associated with the instruction.
In some aspects of the described systems and methods, portions of the instructions may be decoded in parallel while also providing variable-length instructions. These systems and methods may be implemented in a manner that reduces program size and reduces the required area on the silicon chip.
In some aspects, program code is received by decoder 102, which decodes instructions in the received program code. ALU 104 performs integer calculations as needed for particular instructions. Integer calculations involve mathematical calculations with integers (i.e., whole numbers). FPU 106 performs floating-point calculations as needed for specific instructions. Floating-point is a technique for representing numbers with a decimal place in a binary form. Floating-point calculations are handled differently than integer calculations.
In some aspects, ALU 104 accesses values in registers and performs a variety of operations on those values. In particular implementations, CPU 100 may include multiple ALUs 104 that can operate independently of one another. Similarly, FPU 106 may access values in registers and perform various operations on those values. In some aspects, CPU 100 may include multiple FPUs 106 that can operate independently of one another. Cache 108 is capable of storing various data being written to RAM 110 or read from RAM 110.
As discussed herein, prefixes 206-220 have fixed lengths (as shown in
In some aspects, each prefix is associated with a particular suffix. For example, as shown in
The configuration of instruction block 200, with eight fixed-length prefixes 206-220 can be decoded in parallel. Since each prefix 206-220 has the same length, it is a simple process to identify the starting location for each prefix in memory. For example, the starting location in memory for prefix 206 is known based on starting point 202. The starting location in memory for the next prefix (208) is easily determined by adding the fixed length (in bits) to prefix 206. This process continues to find the starting location in memory for each prefix 206-220 as well as the starting location in memory for the first suffix 222.
In some implementations, each prefix 206-220 includes data such as an instruction identifier, a length of the associated suffix, and data used by the instruction (e.g., data used by the variable-length suffix when executed). In other implementations, data used by the instruction may be stored in the suffix instead of (or in addition to) the prefix.
In some aspects, the example of
In some aspects, multiple adder circuits, identified by broken line 302, may include adders 312, 314, 316, and 318. For example, adder 312 adds the values of block 304 (2 bytes) and block 306 (0 bytes) to generate an output of 2 bytes, which is communicated to block 324. The output of adder 312 is also communicated to adders 316 and 318. Blocks 320, 322, 324, 326, and 328 represent an offset for each suffix within the suffix block of the instruction. In particular, block 320 is zero (the starting point of the suffix block). Block 322 is the same as block 304, which is the first offset. Blocks 324 and 326 represent the next two offsets in the suffix block. In some aspects, block 328 represents an offset to the next instruction block. In the example of
Adder 314 adds the value of block 308 (4 bytes) and block 310 (8 bytes) to generate an output of 12 bytes, which is communicated to adder 318. Adder 316 adds an output of block 308 (4 bytes) to the output of adder 312 (2 bytes) to generate an output of 6 bytes, which is communicated to block 326. Adder 318 adds the output of adder 312 (2 bytes) to the output of adder 314 (12 bytes) to generate an output of 14 bytes, which is communicated to block 328. The example of
At 402, a device or system receives an instruction block for execution by a processor, such as a CPU. In some aspects, the instruction block is stored in a contiguous block of memory identified with a starting memory location and, in some situations, an ending memory location. At 404, process 400 identifies multiple fixed-length prefixes in the received instruction block. As discussed herein, an offset value of each of the multiple fixed-length prefixes can be determined based on the known offsets between prefixes due to the same fixed length of all prefixes.
At 406, process 400 identifies multiple variable-length suffixes in the instruction block. As discussed herein, each fixed-length prefix is associated with one of the variable-length suffixes. At 408, the length of each variable-length suffix is determined based on data contained in the associated fixed-length prefix. The lengths of the variable-length suffixes are used to determine offset values to the start of the next suffix. At 410, process 400 determines an offset value for each variable-length suffix using multiple adder circuits. As discussed herein, the multiple adder circuits perform parallel addition operations to process the suffix length data in the prefixes and determine the offset value to the start of the next suffix. At 412, process 400 executes the instructions based on the multiple variable-length suffixes.
In some aspects, the systems and methods described herein may create split register sets to ease routing. For example, multiple register sets can be created and associated with different ALUs. In a simple example, two register sets are created and labeled Register Set A and Register Set B. A first group of ALUs may access Register Set A, a second group of ALUs may access Register Set B, and a third group of ALUs may access Register Set A and Register Set B. In particular implementations, any number of register sets may be created. In some aspects, the register sets may be implemented alongside register naming for the physical registers associated with the microarchitecture. This may provide all of the benefits of reduced routing without exposing the added complexity to an ISA (Instruction Set Architecture).
As CPUs scale to larger sizes (e.g., a larger number of cores), they may need an increased number of physical registers to meet the demand for increased routing. The use of multiple register sets discussed above may ease potential routing issues caused by the larger CPUs.
In some aspects, the split between the two parts of the instruction can be made such that the decoder only needs to load the prefix. Thus, the data in the suffix can be loaded for the execution unit without ever being stored or parsed by the decoder. This approach supports a simplified decoder that stores a low, fixed number of bits, which reduces the size and power consumption of the decoder while improving performance of the decoder.
The example of
In a particular example, suppose each prefix is 16 bits (2 bytes) and there are eight prefixes in a block. If the jump address is 16 byte aligned plus 8 bytes, it means the systems and methods described herein can decode and execute four instructions in the next block. Thus, the jump targets need to be memory aligned according to the number of instructions in that block.
In the example of
Executing the instruction shown in
As shown in
Although the above-described systems and methods are described in the context of various examples of a parallel decode instruction set computer with variable-length instructions, the described systems, devices, apparatuses, and methods are non-limiting and may apply to other contexts, electronic devices, computing configurations, processor configurations, computing environments, and so forth.
Generally, the components, modules, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Some operations of the example methods may be described in the general context of executable instructions stored on computer-readable storage memory that is local and/or remote to a computer processing system, and implementations can include software applications, programs, functions, and the like. Alternatively, or in addition, any of the functionality described herein can be performed, at least in part, by one or more hardware logic components, such as, and without limitation, FPGAs, ASICs, ASSPs, SoCs, CPLDs, co-processors, context hubs, motion co-processors, sensor co-processors, or the like.
In the following, additional examples are described in accordance with one or more aspects of a parallel decode instruction set computer architecture with variable-length instructions.
A method implemented in association with a processor comprises receiving an instruction for execution by the processor, identifying, by a decoder, a plurality of fixed-length prefixes in the instruction, identifying, by the decoder, a plurality of variable-length suffixes in the instruction, wherein each of the plurality of fixed-length prefixes is associated with one of the variable-length suffixes, and executing the instruction based on the plurality of variable-length suffixes.
In addition to any of the methods described herein, each of the plurality of fixed-length prefixes may include data identifying a length of the associated variable-length suffix.
In addition to any of the methods described herein, each of the plurality of fixed-length prefixes may include data identifying an instruction identifier of the associated variable-length suffix.
In addition to any of the methods described herein, each of the plurality of fixed-length prefixes may include data used by the variable-length suffix when executed.
Any of the methods described herein, may further comprise determining an offset value associated with each of the plurality of fixed-length prefixes based on a fixed length of each prefix.
Any of the methods described herein, may further comprise determining an offset value associated with each of the plurality of variable-length suffixes based on results generated by a plurality of adder circuits.
In addition to any of the methods described herein, the plurality of adder circuits may process suffix length data from the plurality of fixed-length prefixes to calculate offset values associated with the plurality of variable-length suffixes.
Any of the methods described herein, may further comprise creating a block that contains a portion of the plurality of fixed-length prefixes and a portion of the variable-length suffixes.
Any of the methods described herein, may further comprise executing the instruction based on the block that contains a portion of the plurality of fixed-length prefixes and a portion of the variable-length suffixes.
Any of the methods described herein, may further comprise identifying at least one fixed-length prefix that does not need execution, and identifying at least one variable-length suffix associated with the fixed-length prefix, wherein executing the instruction includes jumping over the at least one fixed-length prefix and the at least one variable-length suffix.
In addition to the above-described method, an apparatus comprises a processor, and a decoder configured to receive an instruction for execution by the processor, wherein the decoder performs operations comprising identifying a plurality of fixed-length prefixes in the instruction, and identifying a plurality of variable-length suffixes in the instruction, wherein each of the plurality of fixed-length prefixes is associated with one of the variable-length suffixes, and wherein the processor executes the instruction based on the plurality of variable-length suffixes.
In addition to any of the apparatuses described herein, each of the plurality of fixed-length prefixes may include data identifying a length of the associated variable-length suffix.
In addition to any of the apparatuses described herein, each of the plurality of fixed-length prefixes may include data identifying an instruction identifier of the associated variable-length suffix.
In addition to any of the apparatuses described herein, each of the plurality of fixed-length prefixes may include data used by the variable-length suffix when executed.
In addition to any of the apparatuses described herein, the decoder may be further configured to perform operations that determine an offset value associated with each of the plurality of fixed-length prefixes based on a fixed length of each prefix.
In addition to any of the apparatuses described herein, the apparatus may further comprise a plurality of adder circuits, wherein the decoder is further configured to perform operations that determine an offset value associated with each of the plurality of variable-length suffixes based on results generated by a plurality of adder circuits.
In addition to any of the apparatuses described herein, the plurality of adder circuits may be configured to process suffix length data from the plurality of fixed-length prefixes to calculate offset values associated with the plurality of variable-length suffixes
In addition to any of the apparatuses described herein, the decoder may be configured to further perform operations for creating a block that contains a portion of the plurality of fixed-length prefixes and a portion of the variable-length suffixes
In addition to any of the methods or apparatuses described herein, the decoder may be configured to further perform operations that execute the instruction based on the block that contains a portion of the plurality of fixed-length prefixes and a portion of the variable-length suffixes.
In addition to any of the apparatuses described herein, the decoder may be further configured to perform operations that: identify at least one fixed-length prefix that does not need execution, and identify at least one variable-length suffix associated with the fixed-length prefix, wherein executing the instruction includes jumping over the at least one fixed-length prefix and the at least one variable-length suffix.
Although aspects of the described systems and methods have been described in language specific to features and/or methods, the subject of the appended claims is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as example implementations of the described techniques, and other equivalent features and methods are intended to be within the scope of the appended claims. Further, various different aspects are described, and it is to be appreciated that each described aspect can be implemented independently or in connection with one or more other described aspects.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/013934 | 1/26/2022 | WO |