BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a prior art of the data flow of a CPU.
FIG. 2 shows the principle of the invention of compression instruction and data going to a CPU.
FIG. 3 illustrates a basic concept of compressing a group of instructions into fixed length or variable length of bits.
FIG. 4 illustrates the compressed instruction data followed by the bit rate information and the address of starting point of each group of instruction.
FIG. 5 shows the procedure of decoding the starting location with an image frame by calculating the bit rate of each line and the starting address of each group of lines.
FIG. 6 illustrates Procedure of Decoding a program and filling the file register for CPU execution.
FIG. 7 illustrates Block diagram of compressing and decompressing the instruction with an address mapping unit.
FIG. 8 illustrates the block diagram of how the compression engine with the starting address of a group of instruction and output control signals.
FIG. 9 illustrates how the control signals and data/addr bus are interfacing to the storage device.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Due to the fact that the performance of the semiconductor technology has continuously doubled every around 18 months since the invention of the transistor, wide applications including internet, wireless LAN, digital image, audio and video becomes feasible and created huge market including mobile phone, internet, digital camera, video recorder, 3G mobile phone, VCD, DVD, Set-top-box, Digital TV, . . . etc. Some electronic devices are implemented by hardware devices, some are realized by CPU or DSP engines by executing the software or the firmware completely or partially embedded inside the CPU/DSP engine. Due to the momentum of semiconductor technology migration, coupled with short time to market, CPU and DSP solution becomes more popular in the competitive market.
Different applications require variable length of programs which in some cases should be partitioned and part of them be stored in an on-chip “cache memory” since transferring instructions from an off-chip to the CPU causes long delay time and consumes high power. Therefore, most CPUs have a storage device called cache memory for buffering execution code of the program and the data. The cache used to store the program comprising of instruction sets is also named “Instruction Cache” or simply named “I-Cache” while the cache storing the data is called “Data Cache” or “D-Cache”. FIG. 1 shows the prior art principle of how a CPU executes a program. A program is comprised a certain amount of “Instruction” sets 16 and data sets 17 which are the sources and codes of the CPU execution. An “Instruction” instructs the CPU what to work on. The instructions of program are saved in an on-chip program memory, or so called I-Cache memory 11, while the corresponding data which a program needs to execute are saved in an on-chip data memory, or so called D-Cache memory 12. The “Caching Memory” might be organized to be large bank with heavy capacitive loading and relatively slow in accessing compared to the execution speed of the CPU execution logic, therefore, another temporary buffer of so named “File Register” 13, 14 with most likely smaller size, for example 32×32 (32 bits wide instruction or data times 32 rows) is placed between the CPU execution path 15 and the caching memory. The CPU execution path will have some basic ALU functions like AND, NAND, OR, NOR, XOR, Shift, Round, Mod . . . etc, some might have multiplication and data packing and aligning features.
Since the program memory and data memory costs high percentage of die area of a CPU in most applications, this invention reduces the required density of the program and/or data memory by compressing the CPU instructions and data. The key procedure of this invention is illustrated in FIG. 2. The instruction and/or data is compressed 26, 27 before being stored into the program memory 21 and data memory 22. When a scheduled time matched for executing the program or data, the compressed instruction and/or data is decompressed 261, 271 and fed to the file register 23, 24 which is a smaller temporary buffer next to the execution unit 25 of the CPU. The instruction or data can also be compressed by other machine before being fed into the CPU engine. If the coming instruction or data is compressed before, then, the compressed instruction or data can bypass the compression step and directly feeds to the program/data memory, said the I-cache and D-cache.
In this invention, the program of instruction sets is compressed before saving to the cache memory. Some instructions are simple, some are complex. The simple instruction can be compressed also in pipelining, while some instructions are related to other instructions' results and require more computing times of execution. Decompressing the compressed program saved in the cache memory also has variable length of computing times for different instructions. The more instruction sets are put together as a compression unit, the higher compression rate will be reached. FIG. 3 depicts the concept of compressing a fixed length of groups of instructions 31, 32, 33 which together form a program 34. A group of predetermined amount of instructions can be compressed to be fixed length of code 35, 36 or be variable length of each group 37, 38, 39. A group of instruction sets in this invention is comprised of amount of instruction sets ranging widely from 16 instructions to a couple of thousands of instructions depending on the targeted application.
FIG. 4 illustrates the method of organizing the compressed instructions. The compressed instruction data 41 with variable bit rate of each group, for example, a couple of groups of instructions 42, 43, 44 are saved into a storage device from a predetermined location. For accelerating the accessing and decompression speed, a predetermined counter is used to calculate the bit rate of each group 45, 46 and to save it in the temporary register for recording and tracking the starting address in the storage device for each group of instructions. During accessing compressed instructions saved in the corresponding location, the instruction of the starting address 47, 48, 49 of a group of instructions will be extracted and decompressed firstly and be used as reference for reconstructing the rest of instructions within the group.
For saving the hardware, a predetermined amount of groups of instructions shares one starting address of the storage device which saves the compressed instructions. Each group of compressed instructions can have a predetermined length of code to represent the bit rate. For example, a 8 bits code 45, 46 represents 2 times compression (=2048 bits) plus/minus one of (128, 64, 32, 16, 8, 4, 2, 1) bits with predetermined definition. So, the code representing the relative length of each group saves some bits compared to the complete code representing the address of storage device which also save hardware in implementation. In some applications, when a full address representing the location of each group of compressed instructions is not critical, applying code to represent address of each location of group of instructions is applicable. The starting address will be saved into the predetermined location within the storage device which saves the compressed instructions data as well.
A new instruction of the program is compared to the previous instructions to decide whether a match happens. If a match happens, the corresponding previous instruction is used to represent the current instruction. If no matching, the current instruction can still be compressed by information within itself by some compression methods including but not limited to the “Run-Length” coding, entropy coding, . . . etc. A dictionary like buffer with predetermined amount of bits is designed to store the previous instructions. To achieve higher compression rate, the previous instructions are compressed before being saved to the buffer. And will be decompressed again before output to be compared to the new instruction. Theoretically, the larger the buffer, the more instructions it can save and the higher probability it can find a matching instruction from. So, there will be tradeoff in most applications in determining the size of the buffer of storing previous instructions.
FIG. 5 illustrates the procedure of decoding the starting address of each group of the compressed instructions saved in the storage device. The bit rate decoders 53, 54 calculate 55 the length of each group 51, 52 of instructions and adds 57 with the starting address 56, 58, 59 of a couple of groups of compressed instructions will come out of the exact location of the starting or said first of a group of compressed instructions. In most hardware including IC implementation, it takes about 1 or 2 clock cycle of time to decode and calculate the starting location of any group of the compressed instruction and access the starting instruction for reference of other instructions.
In some applications of this invention of I-cache and/or D-cache memory compression, a program or data sets can be compressed by the built-in on-chip compressor, some can be done by other off-chip CPU engine. Both ways of compressing the instruction or data, the compressed program and data set can be saved in the cache memory and decompressed by an on-chip decompression unit. Some instructions random access other instruction or location, for instance, “Jump”, “Go To”, for achieving higher performance, a predetermined depth of buffer or named FIFO (First in, first out), for example, 32×16 bits is design to temporarily store the instructions, and send the instruction to the compressor for compression. For random accessing the instruction and quickly decoding the compressed instructions, the compressor compresses the instructions with each group of instruction with a predetermine length and the compressed instructions are buffered by a buffer before being stored to the cache memory.
FIG. 6 shows the procedure of decompressing the instructions and filling the “File Register” for execution. The compressed instructions stored in the I-Cache memory 61 is input to the DeCompressing unit 601 which includes a predetermined amount of buffer 62, for instance, a 32×16 bits, a DeCompressor 63 and a predetermined amount of the buffer 65, 66 of recovered instructions 64 or so named FIFO. The recovered instructions are fed into the “File Register” 67 which a temporary buffer before the execution path, or so names ALU, Arithmetic and Logic Unit 68. Some instructions wait the result of previous instruction and combine other data which is selected by a multiplexer 69 to determine which data to be fed to the execution unit again.
A complete procedure of compressing and decompressing the instruction set within a CPU is depicted in FIG. 7. An application program with uncompressed instruction sets is compressed 71 and stored into the so named “I-cache” 75 with a predetermined amount of groups of compressed instructions. During compressing, a counter calculates the data rate of each group of compressed instruction and converts it to be starting address of the I-cache memory and saved in an address mapping buffer 73. During decompressing, the compressed instruction sets are accessed by calculating the starting address which is done by the address mapping unit 73. The calculated starting address of a group of instructions will be then accessed and instruction sets are decompressed 74 and temporarily saved in a register array 76 for feeding to the file register 701 in a scheduled timing. The depth of the temporary buffer for saving the decompressed instructions 70, 79 is defined jointly with the file register to ensure the ALU 702 will continuously running instructions without underflow the file register.
The address can be stored in the address mapping unit or embedded into the I-cache memory. For storage device or said the I-cache to be easier in saving the compressed instruction data and starting address of identifying each group of compressed instruction sets, the compressed instructions and starting address can be saved in predetermined different location. In a hardware implementation of compressing the application program as shown in FIG. 8, a compression engine 81 with counters calculating bit rate 87, 88 of group of instructions are combined with a register temporary saving the starting address 82 of group of instructions. The starting address and the compressed instruction data can share the same output bus 83 with a MUX 84 as a output selector or separately output to the targeted storage device. A control unit 83 generates the selection signal as well as sending out two enable signals 85. 86 to indicate the availability of compressed instructions data or starting address. With the valid data on the data/addr bus along with the “Data-Rdy” (data ready) or “Addr-Rdy” (Address ready) signals, the storage device will save the data or address in separate location without confusion.
FIG. 9 shows the timing diagram of the handshaking of the data-addr and control signals of the compression engine. The valid data 93, 94 or address 95, 96 are output by most likely a burst mode with D-Rdy (data valid) 97, 98 and A-Rdy (Address valid) 99, 910 signals with active high enabling. All signals and data are synchronized with the clock 91, 92. With this kind of handshaking mechanism, the storage device or said the I-cache will clearly understand the type and timing of the valid data and starting address of the groups of instructions. The temporary register saving the starting address can be overwritten after the stored address information is sent out to the I-cache. By scheduling outputting the starting address and overwriting the register by new starting address of new groups of compressed instructions, the density of the temporary register can be minimized.
It will be apparent to those skills in the art that various modifications and variations can be made to the structure of the present invention without departing from the scope or the spirit of the invention. In the view of the foregoing, it is intended that the present invention cover modifications and variations of this invention provided they fall within the scope of the following claims and their equivalents.