The present disclosure generally relates to processor instruction set architecture, which is closely related to definitions of processor instruction set, processor architecture design and implementation of micro-architecture. More particularly, the present disclosure relates to a processor having polymorphic instruction set architecture that can be dynamically reconfigured after tape-out.
Recently, Internet, Cloud Computing and Internet of Things (IoT) have been undergoing rapid growth. Ubiquitous mobile devices, RFIDs, wireless sensors are producing information every second and Internet services for billions of users are exchanging a huge amount of information. Meanwhile, users' demands on real-time characteristic and effectiveness of information processing have been increased. For example, in an online video on demand system, users require not only high definition pictures, but also decoding and displaying rates of at least 30 fps. Hence, it is desired to study how to process massive information quickly and efficiently, starting from algorithm characteristic analysis.
In general, the processing of massive information has the following characteristics. First, the amount of data is huge. The amount of data generated by high definition videos, broadband communications, high-accuracy sensors has been increasing by a factor of 5˜10 every year. Second, the amount of computation is huge. The computational complexity for information processing is typically the k-th power of the amount of data n, i.e., O(nK). For example, the bubble sorting algorithm has a computational complexity of O(n2) and the FFT algorithm has a computational complexity of O(nlogn). As the amount of data increases, the amount of computation required for information processing increases significantly. Third, the algorithms for processing massive information are relatively regular. For example, some kernel algorithms, such as one dimensional (1 D)/two dimensional (2D) filtering, FFT transformation and adaptive filtering, can be represented by simple mathematical equations, without complicated logics. Fourth, the processing of massive information has highly localized data. There is no correlation between local data blocks but there is a high correlation in each local data block. For example, in a filtering algorithm, the computation result is only dependent on data within the range of a filtering template and the data within the range of the template needs to be computed several times to obtain the final result. In a video encoding/decoding algorithm, complicated operations need to be applied to one or more (neighboring) blocks of data to obtain the final result, with no data correlation between macro blocks away from each other. Fifth, the modes of the processing algorithms remain substantially the same, while the details of the algorithms keep on evolving. For example, the video coding standard evolves from H.263 to H.264, and the communication protocol evolves from 2G to 3G and then to Long Term Evolution (LTE).
The processing of massive information has its own performance requirements and application characteristics. Since there is a huge amount of data and a huge amount of computation in the processing of massive information and most of them require real-time computation, the computational capabilities of the conventional scalar or super scalar processor are much lower than such requirements. Further, due to the limitation in power consumption and volume, it is impossible to implement a system for processing massive information simply by providing a pile of scalar processors. On the other hand, ASIC chips for processing massive information require high cost and long period to design and develop and their updates are much slower than the evolution of the processing algorithms for massive information, which cannot catch the development speed of the processing systems for massive information. Thus, it is currently a trend in processing chips for massive information to modify the conventional scalar or super scalar processor based on the characteristics of the processing of massive information, or even to design processors in a new field.
The term “instruction” refers to symbols defined by designers and understandable by processors. A programmer can specify actions of a processor at different time instants by sending to the processor different instruction sequences. A set of all instructions understandable by the processor can be referred to as an instruction set of the processor. The programmer can develop various algorithms by utilizing instructions in the instruction set.
A processor instruction set is typically defined and there is a one-to-one correspondence between instruction actions and processor implementations. For example, the ARMv4T instruction set includes a computation instruction “ADD R0, R1, R2”, which means adding the values in the registers R1 and R2 and then writing into R0.
Once the processor instruction set has been defined, the programmer cannot add instructions to the instruction set, or redefine actions for the instructions. Thus, the instructions in the processor instruction set are typically for general purpose to ensure the flexibility in programming. However, such general purpose processor instruction set cannot support some special applications efficiently. For example, in video coding, it is often required to perform 8-bit data calculations and it would be very inefficient to use e.g., the 32-bit addition instruction “ADD R0, R1, R2” in ARM processor for such calculations. Hence, various processors generally extend their instruction sets for special applications, such as MMX instructions for video image processing in the X86 instruction set and NENO instructions in the ARM instruction set.
Such extended instructions are characterized in that they are very efficient for a certain type of application, but is very inefficient for other applications. Accordingly, once the processor has been designed, its application field is decided and it is difficult for it to be applied to other application fields. Programmers cannot refine or optimize the processor based on algorithm characteristics in other application fields.
Some patents have been proposed regarding how to achieve reconfigurable computation. For example, US Patents No. US2005/0027970A1 (Reconfigurable Instruction Set Computing) and No. US2005/0169550 A1 (Video Processing System with Reconfigurable Instructions) adopt a CPU+FPGA-like structure. A user uses a uniform high-level language for development and a compiler partitions a program into a part to be executed by the CPU and a part to be executed by the FPGA. These solutions are characterized by their capabilities of increasing program efficiency by virtue of the flexibility of FPGA. However, the excessively flexible configuration of FPGA results in that the chip is not cost efficiency. US Patent No. US2004/0019765A1 (Pipelined Reconfigurable Dynamic Instruction Set Processor) provides a processor architecture of RISC processor+configurable array processor elements. In this structure, a number of array processor elements are logically divided into a number of pipeline stages and the actions of each pipeline stage is dynamically configured by the RISC processor. US Patent No. US2006/0211387 A1 (Multistandard SDR Architecture Using Context-Based
Operation Reconfigurable Instruction Set Processor) defines a processor architecture of configuration unit+co-processors, where each co-processor includes a state control unit and a data path and is responsible for some similar processor tasks.
It is an object of the present disclosure to provide a processor having polymorphic instruction set architecture, capable of solve the problem that the processor instruction set cannot be redefined after tape-out of the processor.
In order to solve the above problem, a processor having polymorphic instruction set architecture is provided. The processor comprises a scalar processing unit, at least one polymorphic instruction processing unit, at least one multi-granularity parallel memory and a DMA controller. The polymorphic instruction processing unit comprises at least one functional unit. The polymorphic instruction processing unit is configured to interpret and execute a polymorphic instruction and the functional unit is configured to perform specific data operation tasks. The polymorphic instruction is a sequence of a plurality of microcode records to be executed successively. The microcode records indicate actions to be performed by the respective functional units within a particular clock period. The scalar processing unit is configured to invoke the polymorphic instruction and inquire an execution state of the polymorphic instruction. The DMA controller is configured to transmit configuration information for the polymorphic instruction and transmit data required by the polymorphic instruction to the multi-granularity parallel memory.
In an embodiment of the present disclosure, the polymorphic instruction processing unit is configured to receive the polymorphic instruction passively from the DMA controller to be invoked by the scalar processing unit.
In an embodiment of the present disclosure, the scalar processing unit is configured to control the polymorphic instruction processing unit via a first control path and the DMA controller via a second control path.
In an embodiment of the present disclosure, the polymorphic instruction processing unit comprises: a microcode memory configured to store the polymorphic instruction; and a microcode control unit configured to receiving a control request from the scalar processing unit via the first control path and act accordingly.
In an embodiment of the present disclosure, the microcode control unit comprises a configuration register configured to store parameters required for the polymorphic instruction processing unit to operate and an operation state of the polymorphic instruction processing unit.
In an embodiment of the present disclosure, the control request from the scalar processing unit comprises activating or inquiring the polymorphic instruction processing unit and/or reading/writing the configuration register of the polymorphic instruction processing unit.
In an embodiment of the present disclosure, the polymorphic instruction processing unit further comprises a transmission control unit, wherein the functional unit has a plurality of data input/output ports and exchanges data via the transmission control unit.
In an embodiment of the present disclosure, the functional unit is configured to perform data loading/storing operations and read/write data from/to the multi-granularity parallel memory via a first internal bus, while the microcode memory is connected to the first internal bus as a slave device to receive the microcode records passively from outside.
In an embodiment of the present disclosure, the microcode control unit is configured to read and execute the microcode records of the polymorphic instruction in sequence.
In an embodiment of the present disclosure, each line in the microcode memory stores one microcode record. When the scalar processing unit invokes the polymorphic instruction, only a line number of the line in the microcode memory where a starting microcode record associated with the polymorphic instruction is lo located needs to be specified.
With the processor having the polymorphic instruction set architecture according to the present disclosure, programmers can redefine the processor instruction set based on algorithm characteristics of applications after tape-out of the processor. The redefined processor instruction set architecture is more suitable for the algorithm characteristics of the applications, so as to improve the processing performance of the processor for these applications. The redefining operation does not need to modify hardware of the processor or software tool chain including complier and linker. However, for different instruction definitions, the instruction set architecture may have different behaviors.
In the following, the present disclosure will be further explained with reference to the figures and specific embodiments so that the objects, solutions and advantages of the present disclosure become more apparent.
According to the present disclosure, a processor having polymorphic instruction set architecture that can be dynamically reconfigured after tape-out is provided.
A polymorphic instruction is a sequence of a plurality of microcode records to be executed successively. A polymorphic instruction set is a set of polymorphic instructions. The microcode records indicate actions to be performed by the respective functional units within a particular clock period, including e.g., addition operation, data loading operation, or no operation.
Here, the polymorphic instruction processing unit 100 is configured to interpret and execute a polymorphic instruction and the functional unit is configured to perform specific data operation tasks. The scalar processing unit 101 is configured to invoke the polymorphic instruction and inquire an execution state of the polymorphic instruction. The DMA controller 103 is configured to transmit configuration information for the polymorphic instruction and transmit data required by the polymorphic instruction to the multi-granularity parallel memory 102.
The scalar processing unit 101 is configured to control the polymorphic instruction processing unit 100 via a first control path 104 and the DMA controller 103 via a second control path 105. The DMA controller 103 transmits the configuration information to the polymorphic instruction processing unit 100 via a first internal bus 106, and transmits the data to the multi-granularity parallel memory 102 via a second internal bus 107. The DMA controller 103 reads/writes data from/to outside via a bus 108. The polymorphic instruction processing unit 100 reads/writes data from/to the multi-granularity parallel memory 102 via the second internal bus 107.
The scalar processing unit 101 can be an RISC or a DSP and has a first control path 104 for: 1) activating the polymorphic instruction processing unit 100; 2) inquiring an execution state of the polymorphic instruction processing unit 100; and 3) reading/writing a configuration register of the polymorphic instruction processing unit 100 (which will be described hereinafter).
As the multi-granularity parallel memory 102, the multi-granularity parallel memory disclosed in CN Patent Application No. 201110460585.1 (“Multi-granularity Parallel Storage System and Memory”), which can support parallel reading/writing of data from matrices of different data types in rows/columns, can be used.
The second internal bus 107 has the polymorphic instruction processing unit 100 as a master device and the multi-granularity parallel memory 102 as a slave device. The DMA controller 103 and the polymorphic instruction processing unit 100 can read/write data from/to the multi-granularity parallel memory 102 via the second internal bus 107.
The first internal bus 106 has the DMA controller 103 as a master device and the polymorphic instruction processing unit 100 as a slave device. The DMA controller 103 can write the polymorphic instruction into the polymorphic instruction processing unit 100 via the first internal but 106. The polymorphic instruction is stored in an external storage connected to the bus 108.
Polymorphic Instruction Processing Unit
The polymorphic instruction processing unit 100 is configured to receive the polymorphic instruction passively from the DMA controller 103 to be invoked by the scalar processing unit 101.
The polymorphic instruction processing unit 100 includes a microcode memory 200, a microcode control unit 201, at least one functional unit 202 and a transmission control unit 203. The microcode memory 200 is configured to store the polymorphic instruction. The microcode control unit 201 is configured to receiving a control request from the scalar processing unit 101 via the first control path 104 and act accordingly. The microcode control unit 201 includes a configuration register 207 configured to store parameters required for the polymorphic instruction processing unit 100 to operate and an operation state of the polymorphic instruction processing unit 100, e.g., to specify the functional unit 202 for executing the current polymorphic instruction, specify a starting address of the required data and the total data length, and indicate whether the polymorphic instruction processing unit 100 is currently idle or not.
The request includes requests to:
1) activate the polymorphic instruction processing unit 100: the microcode control unit 201 reads the microcode records 300 from the microcode memory 200 and generates corresponding control information for transmission to the functional unit 202 and the transmission control unit 203;
2) inquire the polymorphic instruction processing unit 100: the microcode control unit 201 returns the execution state of the current polymorphic instruction: completed or idle; and
3) read/write the configuration register 207 of the polymorphic instruction processing unit 100: the microcode control unit 201 writes specified data into the specified configuration register 207, or returns data from the specified configuration register 207.
The polymorphic instruction processing unit 100 can design at least one different function unit 202 depending on application requirements. The functional unit 202 is responsible for performing specific data operation tasks, such as addition operations or data loading/storing operations. The functional unit 202 typically has a number of data input/output ports and exchanges data via the transmission control unit 203. For example, after an adder unit has completed an addition operation, it sends the addition result to the transmission control unit 203, which then sends the addition result to a multiplier unit for multiplication.
The transmission control unit 203 is connected to the data input/output ports of all functional units 202, receives source and destination information for data at every time instant from the microcode control unit 201 via the interface 206, and sends the data from the source to the destination.
The bus 107 is the first internal bus 107 in
Definition and Invocation of Polymorphic Instruction
As described above, the “polymorphic instruction” as used herein refers to a sequence of microcode records 300 to be executed successively and having specific functions. As shown in
Depending on algorithm requirements, a programmer can define the behaviors of the polymorphic instruction and the starting line number of the polymorphic instruction in the microcode memory flexibly using the microcode records 300.
Embodiment of Processor Having Polymorphic Instruction Set Architecture
In the following, an exemplary embodiment of the polymorphic instruction set architecture will be described. This embodiment is only an exemplary implementation of the present disclosure and the present disclosure is not limited thereto.
This embodiment relates to a processor having polymorphic instruction set architecture for data-intensive applications.
IALU, FALU, IMAC, FMAC, SHU0 and SHU1 have similar interfaces and are collectively referred to as a computing unit 500 in this embodiment.
BIU0, BIU1 and BIU2 are collectively referred to as a bus interface unit 501, whose internal structure is shown in
M is a register file having a bit width of 512 bits and having four writing ports 800, four reading port 802 and corresponding memory bodies 801.
In the polymorphic instruction set architecture, the calculation results from the respective functional units can be transmitted directly to other functional units for cascaded operations. In this embodiment, there is no need to provide a direct data transmission path between each pair of functional units. For example, FMAC mainly performs floating point multiplying and accumulating operations and its operation results do not need to be transmitted to the fixed point calculation units IALU or IMAC. The reduced number of the data transmission paths is advantageous in that the connecting lines among the functional units can be reduced, thereby reducing the chip area and the chip cost.
The transmission control unit 203 corresponding to
The second layer is composed of ACU, M, SHU0, SHU1 and BIU0-BIU2, as shown in
In order to generate control signals for the 29 multiplexers in the transmission control unit 203, the functional units are first grouped and numbered. As shown in
The microcode control unit 201 transmits the destination information of all the functional units in the microcode record 300 to the transmission control unit 203, which then generates the control signals for the 29 multiplexers based on the destination information.
The foregoing description of the embodiments illustrates the objects, solutions and advantages of the present disclosure. It will be appreciated that the foregoing description refers to specific embodiments of the present disclosure, and should not be construed as limiting the present disclosure. Any changes, substitutions, modifications and the like within the spirit and principle of the present disclosure shall fall into the scope of the present disclosure.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/074426 | 4/19/2013 | WO | 00 |