One of the fundamental operations in video encoding or multi-channel transrating is to use variable length codes (e.g., Huffman codes) to model different values for syntax elements. For example, compression of audio, video and/or speech may rely on such variable length codes. For Huffman coding, the symbols are ordered according to probability of use with the most often occurring values of syntax elements being assigned shorter length codes. Improving the speed of variable length code insertion (reducing the number of processing cycles needed for the operation) improves the overall speed of operations such as video encoding and video transrating.
In accordance with at least some embodiments, a digital signal processor (DSP) includes an instruction fetch unit and an instruction decode unit in communication with the instruction fetch unit. The DSP also includes a register set and a plurality of work units in communication with the instruction decode unit. The DSP selectively uses a dedicated insert instruction to insert a variable number of bits into a register.
In at least some embodiments, a system includes a data source that provides workload data and a DSP coupled to the data source. The DSP modifies the workload data from the data source using a dedicated insert instruction that inserts a variable number of bits into the workload data. The system further comprises a data sink that receives the modified workload data from the DSP.
In at least some embodiments, a method includes receiving, by a DSP, workload data and inserting, by the DSP, a variable number of bits into the workload data using a dedicated insert instruction. The method also includes tracking, by the DSP, a bit pointer location adjacent inserted bits using a dedicated bit pointer tracking instruction associated with the dedicated insert instruction.
For a detailed description of exemplary embodiments of the invention, reference will now be made to the accompanying drawings in which:
Certain terms are used throughout the following description and claims to refer to particular system components. As one skilled in the art will appreciate, companies may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . . ” Also, the term “couple” or “couples” is intended to mean either an indirect or direct electrical connection. Thus, if a first device couples to a second device, that connection may be through a direct electrical connection, or through an indirect electrical connection via other devices and connections. The term “system” refers to a collection of two or more hardware and/or software components, and may be used to refer to an electronic device or devices or a sub-system thereof. Further, the term “software” includes any executable code capable of running on a processor, regardless of the media used to store the software. Thus, code stored in non-volatile memory, and sometimes referred to as “embedded firmware,” is included within the definition of software.
The following discussion is directed to various embodiments of the invention. Although one or more of these embodiments may be preferred, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims. In addition, one skilled in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be exemplary of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment.
One of the fundamental operations in video encoding or multi-channel transrating is to use variable length codes to model different syntax elements. This is how audio or video is compressed. For example, Huffman coding varies the length of multi-bit codes corresponding to information based on probability of use (i.e., more probable values are assigned shorter codes).
Embodiments of the disclosure are directed to techniques for improving the speed of variable length code insertion and thereby improve the speed of operations (e.g., video encoding and/or multi-channel transrating) that rely on variable length code insertion. In at least some embodiments, a dedicated insert instruction is employed by a digital signal processor (DSP) architecture to perform variable length code insertion. Further, a dedicated bit pointer maintenance instruction is employed by the DSP in conjunction with the dedicated insert instruction to track a bit pointer location resulting from variable length code insertion. The techniques described herein may be implemented, for example, with a very-long instruction word (VLIW) architecture such as the C64x™ or C64x+™ DSP architectures.
As shown, the mobile computing system 100 contains a megacell 102 which comprises a processor core 116 (e.g., an ARM core) and DSP 118 which aids the core 116 by performing task-specific computations, such as graphics manipulation and speech processing. The megacell 102 also comprises a direct memory access (DMA) 120 which facilitates direct access to memory in the megacell 102. The megacell 102 further comprises liquid crystal display (LCD) logic 122, camera logic 124, read-only memory (ROM) 126, random-access memory (RAM) 128, synchronous dynamic RAM (SDRAM) 130 and storage (e.g., flash memory or hard drive) 132. The megacell 102 may further comprise universal serial bus (USB) logic 134 which enables the system 100 to couple to and communicate with external devices. The megacell 102 also comprises stacked OMAP logic 136, stacked modem logic 138, and a graphics accelerator 140 all coupled to each other via an interconnect 146. The graphics accelerator 140 performs necessary computations and translations of information to allow display of information, such as on display 104. Interconnect 146 couples to interconnect 148, which couples to peripherals 142 (e.g., timers, universal asynchronous receiver transmitters (UARTs)) and to control logic 144.
As an example, the mobile computing system 100 may correspond to devices such as a cellular telephone, a personal digital assistant (PDA), a text messaging system, and/or a smart phone. Thus, some embodiments may comprise a modem chipset 114 coupled to an antenna 96 and/or global positioning system (GPS) logic 112 likewise coupled to an antenna 98.
The megacell 102 further couples to a battery 110 which provides power to the various processing elements. The battery 110 may be under the control of a power management unit 108. In some embodiments, a user may input data and/or messages into the mobile computing system 100 by way of the keypad 106. Because many mobile devices have the capability of taking digital still and video pictures, in some embodiments, the computer system 100 may comprise a camera interface 124 which enables camera functionality. For example, the camera interface 124 may enable selective charging of a charge couple device (CCD) array (not shown) for capturing digital images.
Although the discussion of
In accordance with C64x+ DSP core embodiments, the instruction fetch unit 202, 16/32-bit instruction dispatch unit 206, and the instruction decode unit 208 can deliver up to eight 32-bit instructions to the work units every CPU clock cycle. The processing of instructions occurs in each of two data paths 210A and 210B. As shown, the data path A 210A comprises work units, including a L1 unit 212A, a S1 unit 214A, a M1 unit 216A, and a D1 unit 218A, whose outputs are provided to register file A 220A. Similarly, the data path B 210B comprises work units, including a L2 unit 212B, a S2 unit 214B, a M2 unit 216B, and a D2 unit 218B, whose outputs are provided to register file B 220B.
In accordance with C64x+ DSP core embodiments, the L1 unit 212A and L2 unit 212B are configured to perform various operations including 32/40-bit arithmetic operations, compare operations, 32-bit logical operations, leftmost 1 or 0 counting for 32 bits, normalization count for 32 and 40 bits, byte shifts, data packing/unpacking, 5-bit constant generation, dual 16-bit arithmetic operations, quad 8-bit arithmetic operations, dual 16-bit minimum/maximum operations, and quad 8-bit minimum/maximum operations. The S1 unit 214A and S2 unit 214B are configured to perform various operations including 32-bit arithmetic operations, 32/40-bit shifts, 32-bit bit-field operations, 32-bit logical operations, branches, constant generation, register transfers to/from a control register file (the S2 unit 214B only), byte shifts, data packing/unpacking, dual 16-bit compare operations, quad 8-bit compare operations, dual 16-bit shift operations, dual 16-bit saturated arithmetic operations, and quad 8-bit saturated arithmetic operations. The M1 unit 216A and M2 unit 216B are configured to perform various operations including 32×32-bit multiply operations, 16×16-bit multiply operations, 16×32-bit multiply operations, quad 8×8-bit multiply operations, dual 16×16-bit multiply operations, dual 16×16-bit multiply with add/subtract operations, quad 8×8-bit multiply with add operation, bit expansion, bit interleaving/de-interleaving, variable shift operations, rotations, and Galois field multiply operations. The D1 unit 218A and D2 unit 218B are configured to perform various operations including 32-bit additions, subtractions, linear and circular address calculations, loads and stores with 5-bit constant offset, loads and stores with 15-bit constant offset (the D2 unit 218B only), load and store doublewords with 5-bit constant, load and store nonaligned words and doublewords, 5-bit constant generation, and 32-bit logical operations. Each of the work units reads directly from and writes directly to the register file within its own data path. Each of the work units is also coupled to the opposite-side register file's work units via cross paths. For more information regarding the architecture of the C64x+ DSP core and supported operations thereof, reference may be had to Literature Number: SPRU732H, “TMS320C64x/C64x+ DSP CPU and Instruction Set”, October 2008, which is hereby incorporated by reference herein.
Variable length code insertion can be performed by the C64x and C64x+ DSP architectures without the dedicated insert instruction and the dedicated bit pointer instruction described herein, but the performance is reduced. When performing variable length code insertion, a worklist may be built instead of performing variable length code insertion one code at a time. The worklist enables use of software pipelining to achieve an additional boost in performance. The serial assembly code for a legacy variable length code insertion technique is shown below.
The above serial assembly code, when run through the software pipeliner, produces the 3 cycle loop shown here:
The same serial assembly code when compiled for the C64x+ architecture uses the SPLOOP mechanism and achieves a 2 cycle loop thereby showing a 33% improvement relative to the C64x architecture. Code for the C64x+ architecture is shown below after making a slight change to the serial assembly code, where instead of using LDDW to load code and length as a single double word, two load words are used to load the words on opposite data paths.
The scheduled C64x+ code is shown below:
In a steady-sate in the loop buffer will appear as follows:
In the above steady-state code all of the work units except M1 and M2 are used. In accordance with embodiments of the disclosure, use of the dedicated insert instruction and a dedicated pit pointer instruction for variable length code insertion doubles the performance of the looping case and reduces the number of cycles for the list-schedule case (the non-looping case) by 4. The improvement in performance is accomplished without modifying the load-store bandwidth.
With the dedicated insert instruction and dedicated bit pointer maintenance instruction disclosed herein, the performance of variable length code insertion is improved compared to the variable length code insertion techniques described previously (i.e., the number of cycles and/or the total number of work units needed to perform variable length code insertion is reduced). The dedicated insert (“INS”) instruction can be viewed as a generalization of the shift right merge byte, to any bit location as seen below.
INS B_outw: B_code, B_bp, B_outw:B_code
The INS instruction results in the operations:
In at least some embodiments, the INS instruction operates within existing opcode limitations of the C64x, where only one double word source is specified. In order to facilitate use of the INS instruction, the output codeword (outw) is assumed to be a 32-bit field, which contains the partial word left justified. Meanwhile, the bit pointer (bp) which contains the number of bits from the left that have been filled has a value 0<=bp<=31. It is also assumed that the loaded codeword is maintained in memory left-justified. This method of maintaining bits is preferable as the partial word can be updated in a simple fashion as follows:
outw=(outw|(code>>bp))
The codeword may have some overflow bits that can be computed and placed in a next codeword as:
code=(code<<(32−bp))
With “outw” and “code” maintained in this way, variable length code insertion is possible without knowledge of “len”. Advantageously this technique is possible within the existing op-codes that are allowed in the C64x architecture. If, on the other hand, both the partial output codewords “outw” and “code” are maintained right-justified, then the following update operations will be needed.
shift=(len<(32−bp))?len:(32−bp);
outw=(outw<<shift)|(code>>(len−shift)); code=(code & (1<<(len−shift)−1))
As an example: if bp=30 and len=5 for the code, this will cause overflow since only 2 bits out of the 5 bits to be inserted can be accepted.
shift=(5<2)?5:2 results in shift=2.
outw=(outw<<2)|(code>>3), code=code & ((1<<3)−1)=code & 7;
(retain lower 3 bits)
As another example: if bp=28 and len=2 for the code, this will not cause overflow:
shift=(2<<4)?2:4 results in shift=4
outw=(outw<<2)|(code>>0); code=(code & ((1<<0)−1))=code &0=0
With the update equations, a compare and three shifts are needed and thus maintaining partial output codewords right-justified requires more operations than maintaining partial output codewords left-justified. Additionally, maintaining partial output codewords right-justified requires knowledge of both the bit pointer “bp” and the length of the inserted code “len”.
If the partial output word “outw” is maintained left-justified, then:
shift=(len<(32−bp))?len:(32−bp);
outw=outw|(code>>(len−shift)); code=code<<(32−len+shift).
These operations leave code left-justified (outw is also left-justified) in case of overflow and require fewer operations. However, knowledge of bit-pointer “bp” and length of code “len” is still required.
To track the bit pointer position, the dedicated bit pointer maintenance (“MAINT”) instruction is used. The MAINT instruction allows users to track overflow for powers of 2 (to simplify the modulo calculation), which can be programmed in a control register. In at least some embodiments, the value needed is 32 (i.e., 2̂5). The MAINT instruction reduces the number of operations needed to track the bit pointer location as shown below.
These operations add the length of the codeword “len” to the existing bit-pointer “bp” and if the result exceeds 32, a flag is set for B_f. Otherwise, B_f is 0. Also the incremented bit pointer is kept at modulo 32. This requires one addition and two parallel ANDs on the modified value. These operations can be done in a single cycle since the addition adds two 5-bit values, checks for a carry in the 6th bit, and removes the carry if it exists to keep the value mod 32.
The serial assembly code that uses these two instructions is shown below. In at least some embodiments, use of the INS and MAINT instructions, enables put-bit operations to be performed on two independent channels in parallel.
For the looping case, the performance improvement resulting from use of the INS and MAINT instructions for variable length code insertion is 2×. This is because the resulting piped loop kernel works on two channels in the same 2 cycles, thus effectively doubling the throughput compared to the C64x+ performance for multichannel applications. The corresponding piped loop kernel is shown below.
As shown in the pipe loop kernel above, the AND and ANDN operations related to the MAINT instruction are performed by an .L work unit. Meanwhile, operations related to the INS instruction are performed by an .S work unit. Further, since “outw” and “code” need to be a register pair, and since “code” and “len” need to be loaded as a register pair, an extra set of moves are required to moved the loaded code into the register pair.
For the list scheduled case, when work is performed on two channels, the original C64x, C64x+ code take 7 cycles after the codeword and length have been loaded as shown below:
In contrast, with the INS and MAINT instructions, the two channels are completed in 3 cycles after the code and length are loaded, thereby saving 4 full cycles for some other operations to run. In addition, even during the busy compute cycles some additional non-M units (e.g., L and S work units) are free in 2 out of the 3 cycles allowing for more threads to be parallelized within the current computation. Thus, these instructions will allow performance (the reduction is cycles) to be improved beyond 7/3=2.33×. As a comparison, the 3 cycle performance of the list scheduled case (once code and length have loaded as shown below) is the same as the looping performance on the C64x. Further, the INS and MAINT instructions are advantageously possible within the existing op-code space of the C64x and C64x+ architectures.
To summarize, the INS and MAINT instructions described herein reduce the total number of DSP cycles needed to perform variable length code insertion. In the example loop given above, two sets of INS and MAINT instructions are performed on two data paths over 3 cycles. The two sets of INS and MAINT instructions may correspond to encoding in accordance with two alternative Huffman tables. In other words, it may not be possible to know a priori which Huffman table will give the best compression, so the INS and MAINT instructions may be used to encode received data in parallel based on two candidate Huffman tables. Thereafter, the bit stream produced by each data path is analyzed and the bit stream with fewest bits is selected as it is compressed more efficiently. This technique for selecting a Huffman table may be implemented for encoding standards (e.g., JPEG) that allow a particular Huffman table to be specified. In accordance with at least some embodiments, selection between multiple possible Huffman tables is deferred and is based on the results of real-time encoding by a DSP with parallel data paths, rather than a sub-optimal static selection.
The reduction of cycles by use of the INS and MAINT instructions applies to the both the looping (e.g., software pipeline) scenario and to the scheduled list (non-looping) scenario. Further, the INS and MAINT instructions reduce the total number of work units needed during at least some of the DSP cycles dedicated to variable length code insertion (i.e., increased amounts of parallel operations for work not related to variable length code insertion can be performed). Although the INS and MAINT instructions have been described for (and are compatible with) the C64x and C64x+ architectures, other DSP architectures may similarly benefit from a dedicated insert instruction and a dedicated bit pointer maintenance instruction for performing variable length code insertion.
In at least some embodiments, the DSP 304 performs variable length code insertion on workload data received from the data source 302 using a dedicated insert instruction (e.g., the INS instruction) 306 and a dedicated bit pointer maintenance instruction (e.g., the MAINT instruction) 308. The dedicated insert instruction operates, for example, as a shift right merge byte operation to any bit location of a register. The dedicated bit pointer maintenance instruction operates to track a bit pointer location resulting from the dedicated insert instruction. The modified workload data (modified by variable length code insertion operations) is output from the DSP 304 to the data sink 316. In at least some embodiments, the workload data corresponds to video data and the variable length code insertion operations are for video encoding and/or video transrating.
During the variable length code insertion operations, the dedicated insert instruction may, for example, cause selective insertion of 1 to 32 bits into a 32-bit register. When the register is full, the contents of the register are output and a next codeword begins. In conjunction with the dedicated insert instruction, the dedicated bit pointer maintenance instruction causes a register bit location following (e.g., adjacent to) inserted bits associated with the dedicated insert instruction to be marked. If there are any overflow bits resulting from the dedicated insert instruction being performed, a codeword comprising the bits of the filled register is moved to a memory (e.g., data sink 316) and the overflow bits form a next codeword. In accordance with at least some embodiments, a DSP register stores a left-justified codeword and the dedicated insert instruction inserts a variable number of bits from left to right in the register. In such embodiments, overflow bits for a next codeword are also left-justified.
The dedicated insert instruction 306 and the dedicated bit pointer maintenance instruction 308 may be executed, for example, during a loop mode 310 or during a list schedule (non-loop) mode 312 of the DSP 304. During the loop mode 310, the dedicated insert instruction 306 may be performed multiple times for video encoding workload of the DSP. Such use of the dedicated insert instruction 306 reduces a total number of DSP cycles and/or a total number of work units dedicated to video encoding during the video encoding workload. As another example, during the loop mode 310, the dedicated insert instruction may be performed multiple times for a video transrating workload of the DSP to reduce a total number of DSP cycles and/or a total number of work units dedicated to video transrating during the video transrating workload.
In
In at least some embodiments, the method 500 may comprise additional steps. For example, the method 500 may additionally comprise performing variable length code insertion multiple times to encode video data using a loop mode (e.g., a software pipeline) of a DSP. Additionally or alternatively, the method 500 may comprise performing variable length code insertion multiple times for transrating video data using a loop mode (e.g., a software pipeline) of a DSP. The method 500 may additionally comprise detecting a filled register and sending a codeword comprising bits of the filled register to a memory. If there are any overflow bits resulting from executing a dedicated insert instruction, the method 500 may comprise starting a next codeword using the overflow bits.
The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.