In the Figures, similar components and/or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label with a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.
a-4b provide a flow diagram 400 showing a method in accordance with some embodiments of the present invention for calculating NC;
The present invention is generally related to systems and methods for encoding and decoding information. More particularly, the present invention is related to systems and methods for encoding and/or decoding video information.
In general, the context adaptive techniques offered by, for example, the H.264 specification are designed to take advantage of several characteristics of quantized blocks. In general, a ‘block’ is 4×4 partition of pixels, which is a part of macro block (which is a 16×16 partition of pixels). Additional information about CAVLC and CAVLD is included in the H.264 Specification available from ITU-T. In particular, CAVLC uses run-level coding to compactly represent strings of zeros which frequently occur in the quantized blocks. In addition, the highest non-zero coefficients in a quantized block are often sequences of +/−1. CAVLC signals the number of +/−1 coefficients in a compact way. These are often referred to as “trailing ones” or “T1s”, and are coded separately in single bits with a ‘0’ representing a +1 and a ‘1’ representing a −1. Also, there is often a substantial amount of correlation among neighboring blocks in terms of the number of non-zero coefficients. CAVLC exploits this characteristic by taking the neighboring blocks' non-zero coefficients as predictors to code the current block's total number of non-zero coefficients. This total number of non-zero-coefficients is encoded using a selected look-up table, with the selection of look-up table depending upon the number of non-zero coefficients in neighboring blocks. As will be appreciated by one of ordinary skill in the art, CAVLD performs the reverse of the CAVLC processes to reconstruct the compressed data stream created using CAVLC.
Some embodiments of the present invention provide advanced approaches for performing CAVLC and/or CAVLD that may be in some cases advantageous when implemented on a VLIW processor. In some cases, the embodiments utilize one or more processes either separate or in combination including table look-ups, formulas, and unique bit-pattern arrangements for availability of neighbors along with the right composition of different software pipelined loops to provide an efficient processing platform. The aforementioned processes may be utilized in relation to segregating residual block data provided during data encoding into different symbols including, Coeff_Token (indicating total number of non-zero coefficients and number of trailing ones), levels and/or run before values.
Some embodiments of the present invention provide systems and methods for decoding video image data. As used herein, the phrase “video image data” is used in its broadest sense to mean any series or group of two or more related images. Thus, video image data may be, but is in no way limited to, a video that includes multiple frames of image data. Based on the disclosure provided herein, one of ordinary skill in the art will recognize a number of types of video image data that may be accessed and/or manipulated in accordance with one or more embodiments of the present invention. Such methods may include receiving an encoded video image data set. As used herein, the phrase “encoded video image data set” is used in its broadest sense to mean any portion of video image data that has been modified from one form to another form. Thus, an encoded video image data set may be, but is not limited to, H.264/MPEG-4 AVC encoded data. The methods further include determining a run before value and a non-zero coefficient value based on the video image data set. As used herein, a “run before value” is any indicator that suggests the number of zero values proceeding or preceding a non-zero value. Thus, as just one example, where a stream of information includes four zeros followed by one non-zero, the run before value may be four. In the method, the non-zero coefficient value is stored to a memory register, and a position of the non-zero coefficient value is determined based at least in part on the run before value. In addition, an inverse quantization is performed on the non-zero coefficient value prior to removing the non-zero coefficient value from the memory register. Such an inverse quantization may be any calculation or mathematical procedure as is currently performed in relation to decoding encoded video image data.
Various systems in accordance with the aforementioned embodiments may include a processor based computer associated with a computer readable medium, where the computer readable medium includes instructions executable by the processor. As used herein, the term “processor” is used in its broadest sense to mean any system or device capable of executing instructions. Thus, as just one example, the processor may be what is generally referred to as a microprocessor, a microcontroller, or a digital signal processor. In some cases, the processor is a substantially parallel device such as a very long instruction word device as are known in the art. In some cases, the instructions are software, firmware and/or machine code that are either directly executable by the processor, or that may be compiled or otherwise transformed for execution by the processor.
Other embodiments of the present invention provide systems and methods for manipulating video data. Such methods include providing a look up table memory that is organized as a plurality of words. Such a memory may be implemented using any computer readable media including, but not limited to, a hard disk drive, a random access memory, an electrically erasable read only memory, a magnetic storage media, an optical storage media, combinations thereof, and/or the like. Each of the plurality of words is accessible via a single access to the look up table memory. A particular word of the plurality of words includes at least a two decoded run before values. In some cases, the methods further include receiving an encoded video image data set, and extracting an encoded run before value from the encoded video image data set. As used herein, the phrase “encoded run before value” is a run before value that has in some way be modified, and may be decoded to retrieve the original value.
Yet other embodiments of the present invention provide systems and methods for decoding an encoded video data image set. Such methods include assigning a neighbor block availability word to a block within the encoded video image data set, and loading an array of neighbor block information associated with the block within the encoded video image data set. As used herein, the phrase “neighbor block availability word” is used in its broadest sense to mean any information set that is indicative of whether or not a particular block is surrounded by other available blocks. An NC value associated with the block within the encoded video image data set is calculated using a parallel tailored equation to perform the calculation. As used herein, an “NC value” represents the index used to retrieve the Coeff_Token symbol from a look-up table. Also, as used herein, the term “Coeff_Token” denotes a data set that contains the information regarding number of non-zero coefficients and number of trailing ones of a particular block of data. Further, as used herein, the phrase “parallel tailored equation” is used in its broadest sense to mean any equation and/or calculation process that is executable with reduced data dependency.
Discussion of the inventions is presented in relation to a flow diagram of
There are four choices of look-up table to use for encoding Coeff_Token that are specified in the H.264 standard. The choice of table depends on a variable NC. NC is derived from number of non-zero coefficients in upper (NT) and left-hand (NL) previously coded blocks. Thus, one of the first tasks to be performed is to determine the availability of the neighboring blocks. In some cases, an available neighboring block will belong to the same macro block, while in other cases, it will belong to a different macro block.
When decoding the Coeff_Token, the value of NC is derived from the neighboring blocks' non-zero coefficients (NT and NL). NC is used to determine the table index required for decoding Coeff_Token symbol of the current block. NC is calculated based on the average of available NT and NL otherwise it is simply assigned a value of either NT or NL that is available. If neither NT nor NL are available, NC is assigned a default value of zero. The following equations describe the aforementioned conditions:
N
C=(NT+NL+1)/2, where both NT and NL are available (which may be implemented as
(NT+NL+1)>>1 where an integer operation is desired);
NC=NT, where only NT is available;
NC=NL, where only NL is available; and
NC=0, where neither NT nor NL are available.
Turning to
Turning now to
It is first determined whether the Column counter is equal to zero (block 406). In such a situation, NL for the block being processed is in a left-hand macro block (i.e., Left MB). Thus, where the Column counter is not equal to zero (block 406), the neighboring NL block for the block being processed is within the current macro block (i.e., Current MB) (block 424). Alternatively, where the Column counter is equal to zero (block 406), the neighboring NL block for the block being processed is found in Left MB (block 427) where Left MB is available (block 409).
Where a value was assigned for NL (blocks 424, 427), it is determined whether the Row counter is equal to zero (block 418). Where the Row counter is not equal to zero (block 418), the neighboring NT block for the block being processed is within the Current MB (block 436). Alternatively, where the Row counter is equal to zero (block 418), the neighboring NT block for the block being processed is found in upper macro block (i.e., Top MB) (block 439) where Top MB is available (block 421). In either of the aforementioned cases (blocks 436, 439) a value is assigned to both NL and NT, and thus the value of NC is described by the following equation: NC=(NL+NT+1)/2 (block 442). Alternatively, where Top MB is not available (block 421), no value is assigned for NT, and the value assigned to NC is described by the following equation: NC=NL (block 445).
Where the Column counter is equal to zero (block 406) and the Left MB is not available (block 409), no value is assigned to NL. It is additionally determined whether the Row counter is equal to zero (block 412). Where the Row counter is not equal to zero (block 412), the neighboring NT block for the block being processed is within the Current MB (block 430). Alternatively, where the Row counter is equal to zero (block 412), the neighboring NT block for the block being processed is found in the Top MB (block 433) where Top MB is available (block 415). In either of the aforementioned cases (blocks 430, 436) a value is assigned to NT but not NL, and thus the value of NC is described by the following equation: NC=NT (block 448). Alternatively, where Top MB is not available (block 415), no value is assigned to either NL or NT and the value assigned to NC is zero (block 451).
With the value of NC thus calculated, NC may be used to decode the Coeff_Token and finish the CAVLD process for the given block as is known in the art (block 454). In general, the remaining processing is the reverse processes of those described below in relation to blocks 220-250 of
The process shown in flow diagram 400 demands considerable processing bandwidth (approximately three hundred cycles for each macro block processed), as well as memory to store the corresponding co-ordinates associated with each block. In contrast, one or more embodiments of the present invention implement a bit pattern based method for determining NC. An example of such embodiments is more fully described in relation to
A twenty-four bit pattern (i.e., Avail_Info) is defined for each block depending upon the position of the macro block within a given slice.
Alignment 620 includes the current MB at least one row from the top of a slice 622, and at the far left column of slice 622. In this case, the far left column of predictors L1-L8 are not available, but all T1-T8 are available for the current MB. This is depicted in a region 625 where a ‘1’ is placed in each position representing an available predictor, and a ‘0’ indicates unavailable predictors for the twenty-four blocks corresponding to those described in
Alignment 630 includes the current MB at least one column from the far left of a slice 632, and at the top of slice 632. In this case, the top row of predictors T1-T8 are not available, but all L1-L8 are available for the current MB. This is depicted in a region 635 where a ‘1’ is placed in each position representing an available predictor, and a ‘0’ indicates unavailable predictors for the twenty-four blocks corresponding to those described in
Alignment 640 includes the current MB at the far left and top of a slice 642. In this case, neither of predictors T1-T8 nor L1-L8 are available for the current MB. This is depicted in a region 645 where a ‘1’ is placed in each position representing an available predictor, and a ‘0’ indicates unavailable predictors for the twenty-four blocks corresponding to those described in
Turning now to
NL_Arr and NT_Arr are updated using the respective left and top indices with the non-zero coefficient value decoded for the current block (block 720). This is done before starting the process of CAVLD including the Coeff_Token decoding for the subsequent block. Separate NL_Arr and NT_Arr are maintained for Cb and Cr. In particular, an array of left neighbors (i.e., NL_Arr[0..3]) is filled with the far right column of the available neighboring left MB and an array of top neighbors (i.e. NT_Arr[0..3]) is filled with the bottom row of the available top MB. For example, as illustrated in
The coded block pattern (i.e., CBP) is expanded to form a coded sub-block pattern (i.e., CSBP) (block 730). Generating CSBP from CBP may be used in one or more embodiments of the present invention to provide memory savings and form an optimized reconstruction loop as more fully described in relation to block 740 below. In general, the CBP is provided for each 8×8 block indicating whether the 8×8 block includes any non-zero coefficients and thus has to be decoded. A CBP is assigned to each block and results in an irregular decode loop structure that often exhibits substantial overhead due to abrupt branching. In addition, general approaches to CBP coding allocate memory based on worst case scenarios where all blocks for a given macro block are assumed to be coded with non-zero coefficients.
The CBP is a six bit pattern that is available from the bitstream. In particular, the CBP is a six bit pattern with four least significant bits (i.e., right bits) assigned to Luma and the two most significant bits (i.e., left bits) assigned to chroma. Of the two chroma bits, the farthest left is a DC value and the other is an AC value. Where the DC value is equal to a ‘0’, the AC value will also be equal to ‘0’. Thus, possible chroma bit values (uvDC, uvAC) include: 11, 10, 00. The standard six bit CBP is expanded to a twenty-four bit CSBP. The CSBP is used to indicate blocks for which an NC value is to be calculated. By providing this information, a non-branching direct index and calculation of an address for coded blocks is possible. Further, as more fully described below, the CSBP provides for efficient memory utilization by marking the zero-coefficient blocks, and only allocating memory for use in relation to the non-zero coefficient blocks. Thus, reconstruction loops make use of the CSBP and perform inverse transform and error addition only on the blocks with non-zero coefficients.
Expanding the CBP to obtain the CSBP begins by setting four consecutive bits of the CSBP equal to each bit of the CBP. This provides for the initial expansion from six bits to twenty four bits. This process is completed as the CBP is accessed from the bit stream. As a further refinement, where any of the chroma AC coefficients are present, it is assumed that the chroma DC component is also present and an inverse chroma hadamard is mandated. This same approach is used where only chroma DC coefficients are present because memory allocation is performed based on CSBP. Table 1 below shows four exemplary initial expansions from CBP to CSBP in accordance with the aforementioned rules.
It should be noted that the CBP can include most combinations of six-bits, and that combination of six bits is initially expanded in accordance with the rules set forth above. The initially expanded CSBP is read from left to right. Where a zero is encountered in reading the CSBP, the corresponding block of the macro block is skipped during the decoding process. As a zero in the CBP is expanded to form four consecutive zeros in the CSBP, each zero in the CSBP will be encountered in a group of four zeros. As one example, where CBP is equal to six or ‘000110’, the last four blocks of Luma are marked as not to be decoded. Further, these blocks as well as all other blocks that contain all zero coefficients are not stored in memory. As will be appreciated from the disclosure provided above, an NC calculation is not needed for blocks that are marked as zero.
Based on the preceding information, the NC calculation is performed (block 740). The NC calculation involves initializing an index for the left neighbor (i.e., IndexNL) and an index for the top neighbor (i.e., IndexNT) (block 743). These indexes are derived from a counter (i.e, Count) that is used to control processing location within the macro block. In particular, IndexNL and IndexNT are derived as follows based on the counter that varies between 0 and 23 and includes at least four least significant bits (i.e., bit3, bit2, bit1, bit0). Luma blocks are indicated by a count between 0 and 15, and Croma blocks are indicated by a count between 16 and 23. For Luma blocks, Index NL equals (bit3, bit1) and Index NT equals (bit2, bit0). For Chroma blocks, Index NL equals bit1 and Index NT equals bit0. Thus, for example, when Count equals 13 (binary representation of ‘1101’), Index NL equals ‘10’ and Index NT equals ‘11’ (each represented in binary). This extraction of bits to get Index NL and Index NT from the counter can be performed efficiently using instructions available in typical digital signal processor.
Table 2 below shows the various values of IndexNL and IndexNT for the blocks shown in
A function, LBDetect(1, CSBP), is called that returns a count of how many contiguous zero coefficients are recorded in the left most portion of the CSBP data. In other words, LBDetect detects the first occurrence of a ‘1’ from the left most side of the CSBP. This number is recorded as LBDetectCnt. Avail_Info is then updated by shifting to the left by an amount equal to the number of contiguous zeros, LBDetectCnt. Thus, Avail_Info is shifted to the left such that the least significant bit (i.e., the farthest right bit) corresponds to the next block with a potentially non-zero coefficient that is marked as a ‘1’ in the CSBP. Avail_Info is then masked with a ‘1’ and that value is stored as Avail_Bit which will have a value of either one or zero depending upon the masked bit. As will be appreciated from reading the aforementioned approach, blocks that are marked as ‘0’ in the CSBP are skipped without using a branch based algorithm. This, avoids calculation of NC for such blocks, and makes the algorithm more suited for a parallel implementation.
Using this information, a parallel tailored NC equation can be used to calculate NC (block 749). This parallel equation eliminates the branching associated with the NC calculation described in relation to
N
C=(NL—Arr[IndexNL]+NT—Arr[IndexNT]+Avail_Bit)>>Avail_Bit
A couple of concrete examples are now provided to demonstrate the previously discussed algorithm. First, the condition where both the NL and NT are available is considered. In such a ease, NL_Arr[0..3] and NT Arr[0..3] have been filled with the appropriate non-zero information from the neighboring blocks and Avail_Bit is equal to one. Further, assume that the luma block under consideration is 14 (i.e, Count=13) as shown in
N
C=(NL—Arr[2]+NT—Arr[3]+1)>>1.
This equation is equivalent to the standard NC equation where both NL and NT are available as described above. As another example, assume NL is available and NT is not available. In such a case, NL_Arr[0..3] has been filled with the appropriate non-zero information from the neighboring block and NT_Arr[0..3]=‘0000’, and Avail_Bit is equal to zero. Further, assume that the luma block under consideration is 1 (i.e., Count=0) as shown in
NC=NL_Arr[0]
Again, this is equivalent to the standard NC equation where NL is available, and NT is not available as described above. Similarly, where we assume NT is available and NL is not available and all other conditions remain the same, the aforementioned parallel tailored NC equation reduces to:
NC=NT_Arr[0]
Again, this is equivalent to the standard NC equation where NT is available, and NL is not available as described above. Similarly, where assume neither NT nor NL are available and all other conditions remain the same, the aforementioned parallel tailored NC equation reduces to:
N
C=0
The calculated NC value is then used to decode the Coeff_Token and processing is completed for the current block (block 750). In particular, after calculating NC, it can be used to select the appropriate look-up table (from one of four look-up tables as per specification in H.264 standard) as set forth in Table 3 below.
Further, in some embodiments of the present invention, the CSBP is further refined based on information achieved during the decoding process. In particular, where the decoded Coeff_Token indicates that the decoded block has at least one non-zero coefficient, the bit in the CSBP corresponding to the decoded block is left as a ‘1’. Alternatively, where the decoded Coeff_Token indicates that the decoded block does not have any non-zero coefficients, the bit in the CSBP corresponding to the decoded block is changed to a zero. Thus, a zero in the CSBP avoids wasting processing time decoding blocks that are known to be all zeros as they are marked with zeros. Further, a sub-block that is found to have all zero coefficients is marked as such precluding any further decoding on the sub-block. In some embodiments of the present invention, this refined CSBP can be used to increase memory utilization related to the storage of decoded coefficients. In particular, a loop responsible for reconstructing the original block may make use of the refined CSBP to limit performance of an inverse transform and/or error addition to only blocks with non-zero coefficients. Further, there is no need to allocate memory to a block that does not include non-zero coefficients.
In some embodiments of the present invention, the memory area saved by not allocating memory for blocks that do not have any non-zero coefficients is utilized for storing predictor blocks from reference regions. The unused memory space may be designated as a reference region that is grown from the opposite end as the coefficient region. This approach dynamically and optimally allocates memory for a variable number of macro blocks within a fixed memory space.
After processing of block 750 is complete, the NT_Arr and NL_Arr are updated with the non-zero coefficient count of the current block (block 753). The aforementioned process (blocks 740 through 753) is repeated for each block within Current MB. This includes determining whether the counter has incremented to twenty-four (block 760). Where Count is less than twenty-four (block 760), Count is incremented, and the coded sub-block pattern is shifted to the right by an amount equal to the LBDetectCnt plus one (block 770). After this, the processes of blocks 740 through 753 are repeated. Alternatively, where the count has increased to twenty-four, the process is completed (block 780).
Returning to
The level (i.e., sign and magnitude) of each of the remaining non-zero coefficients in the block is encoded in reverse order starting with the highest frequency coefficient and working backward to the DC coefficient (block 230). Another set of look-up tables is used to encode the levels depending on the magnitude of each successive coded level. There are seven level look-up tables that can be accessed: Level0 to Level6. The choice of look-up table is adapted by first initializing the table selection to Level0, unless there are more than ten non-zero coefficients and less three T1s where the table selection is initialized to Level1. Next, the highest frequency non-zero coefficient is encoded. Where the magnitude of the preceding non-zero coefficient is larger than a defined threshold, the level is incremented (e.g., from Level0 to Level1). The following Table 4 shows some exemplary threshold levels associated with incrementing the table selection:
Continuing with flow diagram 200, the total number of zeros before the last non-zero coefficient are encoded (block 240). The total number of zeros is the sum of all zeros preceding the highest non-zero coefficient in the reordered block. This is encoded using look-up tables. Next, runs of zeros are encoded (block 250). The number of zeros preceding each non-zero coefficient is commonly referred to as a “run before”. The run before values are coded in reverse order from the high frequency coefficients to the DC coefficient. There are two notable exceptions in run before processing. First, where the number of zeros that remain for processing is zero, run before coding is stopped. Second, it is not necessary to encode the run before occurring before the lowest frequency non-zero coefficient. The look-up table used to encode run before values is chosen based on the number of zeros that have not yet been encoded, and the run before value.
The following example further illustrates the CAVLC encoding process where it is assumed that the value of Coeff_Token is 1, table Num0 is selected for encoding, and the following 4×4 partition is to be encoded:
The 4×4 partition is reordered using the aforementioned zigzag pattern from lower frequency coefficients to higher frequency coefficients to yield the following one dimensional array:
In this case, the number of T1s is two, the number of non-zero coefficients is five, and the total zeros is seven. This information is used to encode Coeff_Token from a table available in the previously mentioned H.264 specification. For purposes of this discussion, we will assume that the encoded Coeff_Token from the table is ‘[COEFF]’. Next, the T1s are encoded from the highest frequency to the lowest frequency. Thus, the code representing the two T1s is ‘[01]’. Next, level decoding is performed using the tables from the H.264 specification for the three levels that are to be represented. For the purposes of this discussion it is assumed that the following encoded level information is provided from the tables ‘[LEVEL(8)], [LEVEL(−2)], [LEVEL(7)]’. Next, the total number of zeros is encoded using a look-up table from the H.264 specification. For the purposes of this discussion, it is assumed that the total number of zeros is encoded to be ‘TOTAL ZEROS’. There are also a total of four run before values that are to be encoded. For the purposes of this description, the four run before values are encoded as follows: ‘[ZEROS LEFT 7, RUN BEFORE 1]; [ZEROS LEFT 6, RUN BEFORE 1]; [ZEROS LEFT 5, RUN BEFORE 3]; [ZEROS LEFT 2, RUN BEFORE 2]’. Thus, the following encoded bit stream is transmitted:
As will be appreciated by one of ordinary skill in the art based on the preceding disclosure, in encoding run before value, there is a dependency on the previous run before value since table selection is a function of zeros left at a given point. Similarly, in decoding run before information, the appropriate look-up table is selected depending on the zeros left at a given point in time. Thus, decoding successive run before values involves a data dependency where the number of zeros left is updated only after completion of the preceding run before. The aforementioned data dependency inherently limits parallelism and reduces the effectiveness of a VLIW architecture. Such a conventional decoding mechanism is illustrated using the following simplified pseudo code provided in Table 5 below:
Following the pseudo-code in Table 3, at part (A) a loop statement indicates that the loop will be repeated as long as there are both some zeros and some coefficients left in the encoded bit stream. Before the loop begins, the zeros left is initialized to the total number of zeros, and the coefficient position is initialized. It should be noted that the pseudocode assumes that there are a maximum of six zeros left, and hence only three bits are read from the encoded bit stream. In the rare case where there are more than six zeros left, it may be handled in a separate decoding function. For each pass through the loop controlled by part (A), parts (B), (C), and (D) are performed. In part (B), run before data is extracted from the run before look up tables using information from the incoming encoded bit stream. The run before look up tables (i.e., RunBeforeTable) is comprised of a number of sub-tables of size TABLESIZE that each correspond to a particular number of zeros left to be decoded. Extracting the run before data includes creating a table index which is the number of zeros left multiplied by TABLESIZE, plus an offset into the sub-table. The offset is found in the three most significant bits of a thirty-two bit word (BitStreamWord) read from the encoded bit stream. Again, this offset is used for lookup into the table. To get these bits, the BitStreamWord is shifted right by twenty-nine bits.
In part (C), the run before value is masked out of the run before data retrieved from the look up table. The run before data contains packed information containing run before value and number of bits to flush. The number of bits allocated to each of the fields will depend on the design of the look-up table. For example, we use four bits each to represent run before value and number of bits to flush. As we pack the run before value in the four least significant bits of the run before data, a four bit mask, 0xF, is used. In addition, the number of bits to flush, BitFlushCnt, out of the received encoded bit stream is accessed by shifting the run before data to the right by four bits. In part (D), the number of zeros left to be decoded and the coefficient position are updated by subtracting from each the run before value.
Some embodiments of the present invention provide a novel approach for decoding run before values such that data dependencies are reduced, and a corresponding increase in parallelism is achieved. Such embodiments provide for decoding two or more run before values in a single table look-up using a modified run before table. For purposes of discussion, the approach is described where two run before values are simultaneously accessed using a modified run before table structure as depicted in
Run before table structure 800 includes a fixed number of bits (i.e., ‘N’) that are read from the bit stream from which either one or two run before values are decoded. The first run before value, RB1, is always valid and is a function of ZerosLeft used in selecting an appropriate sub-table 815, 820, 825, 830, 835, 380. In contrast, the value of ZerosLeft for RB2 is immediately calculated using the equation ZerosLeft=ZerosLeft−RB1. This value can be calculated before a table look-up involving the RB1 data is completed, and thus can be concurrently used as an index into Run before table structure 300 to access the run before value associated with RB2. This reduction in data dependency offers a corresponding increase in parallelism. Whether RB2 is valid is determined by the total number of bits required to decode the combination of RB1 and RB2. If the number of bits required to decode is greater than ‘N’, then only RB1 is valid and CNT should be set to one. It may be possible that a particular table has a valid value for RB2, but that it is not utilized for a look-up because there are no coefficients left before the RB1 decode.
Table 6 below provides pseudo-code representing an exemplary run before decode utilizing run before table structure 800 in accordance with some embodiments of the present invention where ‘N’ equals eight.
Following the pseudo-code in Table 4, at part (A) a loop statement indicates that the loop will be repeated as long as there is at least one coefficient remains to be decoded from the encoded bit stream. Similar to that of Table 3, it should be noted that the pseudocode assumes that there are a maximum of six zeros left, and hence only three bits are read from the encoded bit stream. In the rare case where there are more than six zeros left, it may be handled in a separate decoding function. Before the loop begins, the zeros left is initialized to the total number of zeros, and the coefficient position is initialized. For each pass through the loop controlled by part (A), parts (B), (C), (D), (E) and (F) are performed. In part (B), run before data is extracted from the run before look up tables using information from the incoming encoded bit stream. The run before look up tables (i.e., RunBeforeTable) is comprised of a number of sub-tables of size RBSIZE that each correspond to a particular number of zeros left to be decoded. Extracting the run before data includes creating a table index which is the number of zeros left multiplied by RBSIZE, plus an offset into the sub-table. The offset is found in the eight most significant bits of a thirty-two bit word (BitStreamWord) read from the encoded bit stream. To get these bits, the BitStreamWord is shifted right by twenty-four bits.
In part (C), the two run before values, the two bits to flush values, and the CNT value is masked out of the run before data retrieved from the look up table. The masking is as shown in the pseudo-code and serves to extract the relevant data as depicted in
At part (E) a conditional statement indicates that the loop will be repeated as long as there is at least one coefficient remains to be decoded and that the CNT value indicates that two run before values were included in the run before data retrieved from the memory access of part (B). Where such is the case, the second run before value is accepted, and the various pointers are updated. In particular, in part (F), the bits to flush value is set equal to BF2, the number of zeros left is decremented by the second run before value, and the number of coefficients left is decremented.
Using the preceding approach, up to two run before values are decoded in a single iteration using the aforementioned approach. This leads to better parallelization and software pipelining. In some cases, the parallelization leads to an approximate doubling in performance compared with the single run before decode. Again, it should be noted that the aforementioned approach could be expanded to allow for decoding of three or more run before values for each memory access. This would require additional memory allocation for the run before table to hold the additional run before values, bit flush values, and count bits.
In standard processing, quantization is performed on the encoder side before entropy encoding as shown by quantization block 130 and entropy encoding block 140 of
Some embodiments of the present invention provide for integrating run before value processing with inverse quantization. Such an approach avoids the aforementioned memory loads. This is appropriate where the levels are coded separately from the run before values, and the position of the levels cannot be determined within the level decoding loop. However, some embodiments of the present invention do provide for integrating run before decoding that is integrated with inverse quantization. Such integration approaches avoid inverse quantizing zero coefficients, and extra clock cycles that are wasted loading coefficient values. It should be noted that the refined CSBP indicates to a fine level which blocks do not include any non-zero coefficients. Thus, the refined CSBP may be incorporated into the inverse quantization process to avoid performing inverse quantization on blocks that do not include any non-zero coefficients.
In conclusion, the present invention provides novel systems, methods and arrangements for media production color management. While detailed descriptions of one or more embodiments of the invention have been given above, various alternatives, modifications, and equivalents will be apparent to those skilled in the art without varying from the spirit of the invention. Therefore, the above description should not be taken as limiting the scope of the invention, which is defined by the appended claims.