The present disclosure is generally related to microprocessor instructions.
Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless computing devices, such as portable wireless telephones, personal digital assistants (PDAs), and paging devices that are small, lightweight, and easily carried by users. More specifically, portable wireless telephones, such as cellular telephones and internet protocol (IP) telephones, can communicate voice and data packets over wireless networks. Further, many such wireless telephones include other types of devices that are incorporated therein. For example, a wireless telephone can also include a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such wireless telephones can process executable instructions, including software applications, such as a web browser application, that can be used to access the Internet. Wireless telephones can also include video download and video playback capabilities. As such, these wireless telephones can include significant computing capabilities.
To achieve efficient data transfer, a video bitstream representing a video file may be encoded during transmission to computing devices such as wireless telephones. The video bitstream may also be stored in compressed fashion at the computing devices in order to achieve more efficient utilization of storage space. When the video file is played at a computing device, the computing device may decode the encoded video bitstream. As video encoding methods become more complex, video decoding becomes an increasingly complex computational problem. Further, although parallel processing techniques have improved the speed at which computing devices can perform certain tasks, video decoding may not be significantly improved by parallel processing due to its serial nature (i.e., the ability to decode a particular bit depends on successfully decoding one or more of the preceding bits).
A dedicated arithmetic decoding instruction and logic to execute a dedicated arithmetic decoding instruction is disclosed. The dedicated arithmetic decoding instruction may reduce the amount of processor time to decode an arithmetically encoded video stream. A processor may execute the dedicated arithmetic decoding via computational logic. The computational logic may enable the processor to execute, via a single instruction, a decoding algorithm that would otherwise require several general purpose instructions.
In a particular embodiment, an apparatus is disclosed that includes a memory and a processor coupled to the memory. The processor is configured to execute general purpose instructions. The processor is also configured to execute a dedicated arithmetic decoding instruction retrieved from the memory.
In another particular embodiment, a method is disclosed that includes executing a dedicated context adaptive binary arithmetic coding (CABAC) decoding instruction during a first execution cycle of a processor. The dedicated CABAC decoding instruction accepts as input a first range, a first offset, and a first state. The method also includes storing a second state based on one or more outputs of the dedicated CABAC decoding instruction during a second execution cycle of the processor. The method further includes realigning the first range based on the one or more outputs of the dedicated CABAC decoding instruction to produce a second range during the second execution cycle of the processor. The method includes realigning the first offset based on the one or more outputs of the dedicated CABAC decoding instruction to produce a second offset during the second execution cycle of the processor.
In yet another particular embodiment, an apparatus is disclosed that includes a memory and a processor coupled to the memory. The processor includes means for executing general purpose instructions and means for executing a dedicated arithmetic decoding instruction.
One particular advantage provided by at least one of the disclosed embodiments is the ability to program and execute a dedicated arithmetic decoding instruction at a microprocessor. Dedicated arithmetic decoding instructions may reduce the number of processor execution cycles taken to decode an entropy-encoded video bitstream (e.g., an H.264 CABAC video bitstream).
Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims.
Referring to
The processor 110 includes general purpose instruction execution logic 112 configured to execute general purpose instructions. General purpose instructions may include commonly executed processor instructions, such as LOADs, STOREs, and JUMPS. The general purpose execution logic 112 may include general purpose load-store logic to execute the general purpose instructions. The processor 110 also includes dedicated arithmetic decoding instruction execution logic 114 configured to execute a dedicated arithmetic decoding instruction. The dedicated arithmetic decoding instruction may be executable by the processor 110 to decode a video stream encoded in an entropy coding scheme, such as the context adaptive binary arithmetic coding (CABAC) scheme. In a particular embodiment, the dedicated arithmetic decoding instruction may be used in decoding a video stream that is CABAC-encoded in accordance with the two-hundred and sixty-fourth audiovisual and multimedia systems standard promulgated by the International Telecommunications Union (H.264, entitled “Advanced video coding for generic audiovisual services”).
In a particular embodiment, the general purpose instructions and the dedicated arithmetic decoding instruction are executed by a common execution unit of the processor 110. For example, the common execution unit may include both the general purpose instruction execution logic 112 and the dedicated arithmetic decoding instruction execution logic 114. In another particular embodiment, the dedicated arithmetic decoding instruction is an atomic instruction that is executable by the processor 110 without separating the dedicated arithmetic decoding instruction into one or more general purpose instructions to be executed by the general purpose instruction execution logic 112. The dedicated arithmetic decoding instruction may be a single instruction of an instruction set of the processor 110 and may be executable in a small number of cycles (e.g., less than three execution cycles) of the processor 110. In a particular embodiment, the processor 110 is a pipelined multi-threaded very long instruction word (VLIW) processor.
The memory 120 may include random access memory (RAM), read only memory (ROM), register memory, or any combination thereof. Although the memory 120 is illustrated in
In operation, the processor 110 may be used in decoding an encoded video stream. While decoding a particular bit of the video stream, the processor 110 may retrieve a dedicated arithmetic decoding instruction from the memory 120 and the logic 114 may execute the retrieved instruction.
It will be appreciated that the system 100 of
CABAC is a form of binary arithmetic coding. Generally, binary arithmetic coding may be characterized by two quantities: a current interval “range” and a current “offset” in the current interval range. To decode a particular CABAC-encoded bit, the current range is first subdivided into two portions based on the probability of a least probable symbol (LPS) and a most probable symbol (MPS). For example, the LPS may be a one symbol, the MPS may be a zero symbol, and the current range may be the range between zero and one. Generally, if R is the width of the current range, rLPS is the width of the first portion, rMPS is the width of the second portion, pLPS is the probability of encountering the least probable symbol, and pMPS is the probability of encountering the most probable symbol, then rLPS=R×pLPS and rMPS=R×pMPS=R−rLPS. Thus, when the probability pLPS of the least probable symbol is higher than the probability pMPS of the most probable symbol, the portion corresponding to the least probable symbol will have a larger width rLPS than the width rMPS of the portion corresponding to the most probable symbol. That is, when pLPS>pMPS, rLPS>rMPS. Similarly, when pMPS>pLPS, rMPS>rLPS. Depending on whether the current offset occurs within rLPS or rMPS, the values of rLPS and rMPS are iteratively updated during decoding of the video stream.
For example, rMPS may initially be equal to 0.50, and rLPS may initially be equal to 0.50. That is, the probability of encountering an MPS may initially be 50% and the probability of encountering an LPS may initially be 50%. If the current offset falls within rMPS (i.e., an MPS is encountered), rMPS may be increased and rLPS may be decreased. For example, rMPS may be increased to 0.75 and rLPS may be decreased to 0.25. As another example, rMPS may initially be equal to 0.875 and rLPS may initially be equal to 0.125. If the current offset falls within rLPS, rMPS may be decreased to 0.75 and rLPS may be increased to 0.25.
Decoding a video stream that is CABAC-encoded in accordance with H.264 may be a stateful operation. That is, decoding the video stream may require the maintenance of information (e.g., state, bit position, and MPS bit) other than the range and offset. For H.264, the range is a 9-bit quantity and the offset is an at least 9-bit quantity. The calculation of rLPS may be approximated by a 64×4 lookup table of 256 bytes that stores CABAC constants and that is indexed by range and state. Because the values in the lookup table are constants defined by the H.264 standard, the lookup table may be hard-coded. Alternately, the lookup table may be programmable (e.g., rewriteable).
A dedicated CABAC decoding instruction may realign the range, realign the offset, and lookup CABAC constants as described herein. Such a dedicated CABAC decoding instruction may accept as input CABAC state bits, a CABAC MPS bit, bit position (bitpos) bits, nine CABAC range bits, and at least nine CABAC offset bits. The dedicated CABAC decoding instruction may generate an output including new CABAC state bits, a new CABAC MPS bit, nine CABAC range bits, at least nine CABAC offset bits, and an output value bit representing the decoded bit of the video stream. In a particular embodiment, the decoding process is renormalized as necessary after each iteration such that the value of the MPS bit is always 1. For example, a dedicated CABAC decoding instruction may operate in accordance with the following pseudo-code:
It should be noted that although many of the equations and expressions as set forth herein use a syntax similar to the C or C++ programming language, the expressions are for illustrative purposes and may instead be expressed in other programming languages with different syntax.
The above pseudo-code may be encapsulated into a function DECBIN( ) and a decoded H.264 video bit may be produced in two processor cycles as follows:
The function DECBIN( ) may also be used without the speculative JUMPR:t R31 (i.e., jump to address in register 31) instruction as follows:
Referring to
The processor may store data generated during execution of the dedicated arithmetic decoding instruction in an output register pair 230 and an output predicate register 240. In a particular embodiment, the output register pair 230 is a pair of 32-bit registers.
For example, a first register Rtt.w0211 of the first input register pair 210 may store an input state 201 and an input MPS bit 202. In a particular embodiment, bits zero to five of Rtt.w0211, denoted Rtt.w0[0:5], store the input state 201 and Rtt.w0[8] stores the input MPS bit 202. A second register Rtt.w1212 of the first input register pair 210 may store an input bitpos 203. For example, Rtt.w1 [0:4] may store the input bitpos 203.
A first register Rss.w0221 of the second input register pair 220 may store an input range 204. For example, Rss.w0[0:9] may store the nine bits of the input range 204. A second register Rss.w1222 of the second input register pair 220 may store an input offset 205. In a particular embodiment, at least Rss.w1[0:8] stores the at least nine bits of the input offset 205.
A first register Rdd.w0231 of the output register pair 230 may store an output state, an output MPS bit, and an output range. For example, Rdd.w0[0:5] may store the 6-bit output state, Rdd.w0[8] may store the output MPS bit, and Rdd.w0[23:31] may store the output range. A second register Rdd.w1232 of the output register pair 231 may store an output offset 209 in a normalized fashion. An output value bit 250 of the dedicated CABAC decoding instruction may be stored in a predicate register 240. In a particular embodiment, the output value bit 250 stored in the predicate register 240 may be input into subsequent instructions (e.g., general purpose instructions or a subsequent dedicated CABAC decoding instruction) executed by the processor. For example, the output value bit 250 stored in the predicate register 240 may be used in a decision in the video decoding algorithm.
It will be appreciated that a processor may “pack” the input data for a dedicated CABAC decoding instruction into just two input register pairs and may “pack” the output data for the dedicated CABAC decoding instruction into one output register pair and a predicate register. In a particular embodiment, the use of a dedicated CABAC decoding instruction may reduce the time taken to generate a decoded video stream bit from 7 processor execution cycles (using general purpose instructions) to 2 processor execution cycles. It should be noted that although the dedicated CABAC decoding instruction has been explained herein with reference to the H.264 video compression standard, the instruction may be used in decoding other arithmetically coded bitstreams. For example, the instruction may be used in decoding bitstreams encoded in accordance with the Joint Photographic Experts Group 2000 (JPEG2000) image compression standard. It should be noted that although
Referring to
The logic 300 may be divided into three execution stages: EX1301, EX2302, and EX3303. In a particular embodiment, each execution stage corresponds to a particular execution pipeline stage of a pipelined processor. In a particular embodiment, the execution stages 301, 302, and 303 occur during a single execution cycle of the pipelined processor. During the first execution stage EX1301, five input variables are retrieved: an old MPS value 310, an input state 320, an input offset 340, an input range 341, and an input bitpos 342. In a particular embodiment, the input variables 310, 320, 340, 341, and 342 are packed into input register pairs as described herein with reference to
The input state 320 is used as an index into a CABAC H.264 constants lookup table 322. Four CABAC constants 323 are produced as a result of the index operation and input into a 4-to-1 multiplexer 324 that outputs a selected CABAC constant 327. The index operation also produces a new LPS state constant 325 and a new MPS state constant 326, both of which are passed to EX2302 along with the selected CABAC constant 327. The input state 320 is also applied to a zero comparator 321, and the resulting output from the zero comparator 321 passes from EX1301 to EX2302.
Each of the input offset 340, the input range 341, and the input bitpos 342 are applied to a shifter 343. The shifter 343 produces a shifted range 345 and a shifted offset 346 as output. Control bits 344 from the shifted range 345 are applied to the 4-to-1 multiplexer 324 as control bits. The shifted range 345 and the shifted offset 346 are also passed from EX1301 to EX.
During EX2302, the old MPS value 310 is inverted by an inverter 311. The old MPS value 310 is also applied to a first 2-to-1 multiplexer 312 that is controlled by the output of the zero comparator 321. The output of the inverter 311 is also applied to the first 2-to-1 multiplexer 312. The old MPS value 310, the output of the inverter 311, and the output of the first 2-to-1 multiplexer 312 are passed from EX2302 to EX3303. The new LPS state constant 325, the new MPS state constant 326, and the selected CABAC constant 327 are also passed from EX2302 to EX3303.
The shifted range 345 is applied to a first 9-bit adder 347 that calculates rMPS 348 in accordance with the formula rMPS=Shifted Range−rLPS. rMPS 348 is then applied with the shifted offset 346 to a second 9-bit adder 349 that produces as output 350 the difference between the shifted offset 346 and rMPS 348. rMPS 348, the output 350 of the second 9-bit adder 349, and the shifted offset 346 are passed from EX2302 to EX3303. The second 9-bit adder 349 also generates a control bit 351 responsive to whether or not the output 350 of the 9-bit adder 349 is less than zero. In a particular embodiment, the control bit 351 is generated by checking a sign bit of the output 350. The control bit 351 also passes from EX to EX3303.
During EX3303, the output of the first 2-to-1 multiplexer 312 and the old MPS value 310 are applied to a second 2-to-1 multiplexer 313 that outputs a new MPS value 315. The output of the inverter 311 and the old MPS value 310 are applied to a third 2-to-1 multiplexer 314 that outputs a predicate output value bit Pd 316.
The new LPS state constant 325 and the new MPS state constant 326 are input into a fourth 2-to-1 multiplexer 328 that outputs an output state 330. The selected CABAC constant 327 and rMPS 348 are input to a fifth 2-to-1 multiplexer that outputs an output range 331.
The output 350 of the second 9-bit adder 349 and the shifted offset 346 are applied to a sixth 2-to-1 multiplexer 352 that outputs a first partial output offset 353. The shifted offset 346 is stored as a second partial output offset 354. Each of the 2-to-1 multiplexers 313, 314, 328, 329, and 352 is controlled via the control bit 351. In an illustrative embodiment, the output variables 315, 330, 331, 353, and 354 are packed into an output register pair and the predicate output value bit Pd 316 is stored in a predicate register as described herein with reference to
It will be appreciated that because many processors include a shifter, the logic 300 of
Referring to
The method 400 includes executing a dedicated CABAC decoding instruction during a first execution cycle of a processor, at 402. The dedicated CABAC decoding instruction accepts as input a first range, a first offset, and a first state. For example, in
The method 400 also includes, based on one or more outputs of the CABAC decoding instruction, storing a second state, realigning the first range to produce a second range, and realigning the first offset to produce a second offset during a second execution cycle of the processor, at 404. For example, in
Referring to
The method 500 includes executing a dedicated CABAC decoding instruction during a first execution cycle of a processor, at 502. The processor may be a pipelined multi-threaded VLIW processor and the dedicated CABAC decoding instruction may be executed at a common execution unit of the processor without separating the dedicated CABAC decoding instruction into one or more general purpose instructions. The dedicated CABAC decoding instruction accepts as input a first range, a first offset, and a first state. The dedicated CABAC decoding instruction may be compliant with the H.264 video compression standard. For example, referring
The method 500 also includes, based on one or more outputs of executing the CABAC decoding instruction, storing a second state, realigning the first range to produce a second range, and realigning the first offset to produce a second offset during a second execution cycle of the processor, at 504. For example, referring to
The wireless device 600 includes a processor, such as a digital signal processor (DSP) 610, coupled to a memory 632. In an illustrative embodiment, the DSP 610 may include the processor 110 of
As illustrated in
It should be noted that although the particular embodiment illustrated in
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magneto-resistive RAM (MRAM), spin torque tunnel MRAM (STT-MRAM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description of the disclosed embodiments is provided to enable a person skilled in the art to make or use the disclosed embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.