The present disclosure is generally related to apparatuses and methods for video decoding.
Internet streaming video is a popular application for users of both wired and wireless devices. To reduce bandwidth used by streaming video, video data is generally encoded to compress the video data. Encoding processes seek to compress the video data so as to provide satisfactory image quality without incurring undue decoding overhead at the user end. It is an objective of video encoding and decoding to find a balance between being able to generate high quality video from low bit rate data and low computational complexity.
A popular coder/decoder (CODEC) system for Internet streaming video is the Google-On2 VP6 (VP6) video CODEC. Providing high quality video at a relatively low bit rate results in the VP6 CODEC being computationally intensive. Decoding efficiency may be improved with dedicated decoding hardware, but inclusion of a dedicated video decoding processor in an end-user device increases the cost of the device. Further, it may not be practical to include dedicated decoding hardware in mobile devices, particularly because it may not be practical to incorporate newer codecs into existing hardware in the future. Without dedicated decoding hardware, mobile devices may lack sufficient processing power to decode VP6 video clips, particularly for high definition or “HD” video content.
A general-purpose, multi-threaded processor is associated with firmware including instructions to configure the multi-threaded processor as a specialized video decoding processor. Operating as configured by the firmware instructions, one thread of a processor is configured as a pre-processing thread that allocates macroblocks of video data, such as flash video data compliant with a VP6 format, among other threads configured to process the macroblocks and perform coefficient decoding. The pre-processing thread balances a workload between the processing threads, and the pre-processing thread may act as a processing thread for some macroblocks to further assist in workload balancing. One or more other threads may be configured to perform front-end processing to decode other video data included in received frames of video data or to perform post-processing to enhance the decoded video data. As a result, without allocating space or incurring cost to include a dedicated hardware processor, a digital signal processor or a general purpose processor that supports signal processing instructions can be configured to perform efficient video decoding.
Embodiments of the present disclosure provide electronic devices and methods for equipping a multi-threaded processor with firmware instructions to configure threads to perform functions to support decoding video data, such as VP6. One thread may be configured as a pre-processing thread to allocate macroblocks of video data among one or more processing threads configured to perform video decoding on the macroblocks. A task buffer may be used through which the pre-processing thread allocates macroblocks to particular processing threads without engaging an operating system. A particular thread may be configured as a front-end thread, for example, to decode a frame header and to perform a prediction mode or motion vector parsing. Still another thread may be configured as a post-processing thread to perform deblocking video format transformation, or other video enhancement functions.
In a particular embodiment, an electronic device includes a multi-threaded processor and a memory. The multi-threaded processor is configured to execute digital signal processing instructions. The memory includes firmware including instructions executable by the multi-threaded processor, without use of a dedicated hardware macroblock decoding module, to decode video data compliant with a VP6 format.
In another particular embodiment, an electronic device includes a processor including a plurality of threads and a memory that maintains firmware instructions executable by the processor to perform functions to process video data. The instructions in the firmware configure at least some of the plurality of threads to operate as a plurality of dedicated function threads. The dedicated function threads include one or more processing threads. Each of the processing threads is configured to perform video decoding on one or more macroblocks of video data. The dedicated function threads also include a pre-processing thread configured to receive a plurality of macroblocks and to allocate at least some of the plurality of macroblocks among the one or more processing threads for video decoding.
In another particular embodiment, a method includes receiving video data including a plurality of macroblocks at a processor. The processor includes a plurality of threads. At least some of the plurality of threads are configured according to instructions in firmware associated with the processor to perform dedicated functions. The method also includes configuring the plurality of threads to perform dedicated functions. Configuring the plurality of threads to perform dedicated functions includes configuring one or more of the plurality of threads as processing threads to perform video decoding on one or more macroblocks of the video data. Configuring the plurality of threads to perform dedicated functions also includes configuring one of the plurality of threads as a pre-processing thread to allocate the plurality the macroblocks for the video decoding.
Embodiments of the present disclosure enable efficient video data decoding. Threads of a multi-threaded processor or multiple-threaded digital signal processor are configured to perform dedicated functions according to instructions in firmware of the processor. In a particular embodiment, a thread is configured as a front-end thread to decode parts of the video data, such as a frame header, a prediction mode, or motion vector data. Another thread is configured as a pre-processing thread to allocate macroblock data among multiple other threads configured to perform more intensive decoding, e.g., rendering decoded video from coding coefficients. The pre-processing thread also may be configured to perform video decoding of a macroblock when each of the plurality of processing threads is already performing decoding of another macroblock, thereby helping to prevent or reduce a backlog of macroblock decoding for the plurality processing threads. In a particular embodiment, the pre-processing thread determines to which of the plurality of processing threads to assign the macroblocks and then stores the macroblocks in slots in a lockless task buffer. Each slot in the lockless task buffer is dedicated to a particular one of the plurality of processing threads. Each of the plurality of processing threads may retrieve assigned macroblocks from an assigned dedicated slot in the lockless task buffer as soon as the processing thread completes a previous task. Each of the processing threads can access the lockless task buffer directly and asynchronously without waiting for a lock on the task buffer to be released by another processing thread or having to participate in a contention avoidance process managed by an operating system or other software. In another embodiment, no thread is configured as a front end thread, resulting in additional decoding for the pre-processing thread but freeing another thread to be used as a processing thread.
After threads of the multi-threaded processor 110 are configured to perform dedicated functions for video decoding, the threads of the multi-threaded processor 110 perform those functions to decode video data 130, such as macroblocks of VP6 format data, MPEG-4, H.264, or other video data. In a particular embodiment, the video data is flash video data that is encoded in a VP6 format and that is streamed via the Internet. Configuration of the threads of the multi-threaded processor 110 to perform dedicated functions may enable efficient decoding of the video data 130. The multi-threaded processor 110 decodes the video data 130 to generate decoded video data 140. In a particular embodiment, the video data 130 is decoded at a speed of 30 frames per second or more and at a resolution of up to 1280 by 720.
A general-purpose multi-threaded processor configured according to firmware-based instructions may afford a number of advantages for video decoding. First, a signal processor configured by firmware-based instructions to perform video decoding may provide greater image processing throughput than a general purpose processor performing software-based decoding. Second, including a signal processor that is configurable by firmware-based instructions to perform video decoding provides at least some of the advantages of dedicated decoding hardware without adding the cost or consuming the space that a dedicated video decoder may require. These advantages may be particularly beneficial in a mobile device.
Macroblock data in VP6 format may be transmitted in one or more partitions. In the example of
By contrast, in the two partition case 250, two partitions such as Partition 0260 and Partition 1280 may be employed to carry different portions of data for each of a plurality of macroblocks. For example, the mode data 222 and 232 for macroblocks MB0262 and MB1270 (in the two partition case 250) are presented in a first partition, Partition 0260, while a second partition, Partition 1280, includes the DC/AC coefficients 226 and 236 for the macroblocks MB0262 and MB1270 (in the two partition case 250). The one partition case 200 may be employed for some advanced profile video clips. In the one partition case 200, the single partition 210 is Bool-encoded. The two partition case 250 may be employed for some advanced profile video clips (e.g., clips with high bitrate, high definition content) and simple profile cases. In the two partition case 250, the first partition, e.g., Partition 0260, is Bool-encoded while the second partition, e.g., Partition 1280, is either Bool-encoded or Huffman-encoded.
Regardless of whether the macroblocks are transmitted using the one partition case 200 or the two partition case 250, portions of the macroblock data are distributed to threads within the multi-threaded processor 110 in the same way. In a particular illustrative embodiment in which one of the threads is configured at the front-end thread 201, frame header data 214 (from the one partition case 200) or 254 (from the two partition case 250) is assigned to the front-end thread 201. Mode data 222 and 232 and MV data 224 and 234 is assigned to the front-end thread 201 for decoding.
Processing of DC/AC coefficient data 226 and 236, which is a more intensive aspect of the video decoding, is assigned to the plurality of processing threads 206 by the pre-processing thread 202. More specifically, the macroblock data including the DC/AC coefficient data 226 and 236 is assigned to the pre-processing thread 202 which assigns data for each of the macroblocks to one of the plurality of processing threads 206 via the lockless task buffer 204. The macroblock data is retrieved from the lockless task buffer 204 by each of the plurality of processing threads 206 when each of the plurality of processing threads 206 is ready to accept a next macroblock, as further described with reference to
In a particular embodiment, the pre-processing thread 202 may be configured to perform functions in addition to assigning macroblocks among the plurality of processing threads 206. For example, the pre-processing thread 202 also may parse the DC/AC coefficients 226 and 236 to gauge relative processing complexity of the macroblocks. In addition, to further relieve bottlenecks and distribute the workload, when none of the plurality of processing threads 206 is available to decode a particular macroblock, the pre-processing thread 202 itself may decode the particular macroblock.
A post-processing thread 240 receives the decoded macroblocks from the plurality of processing threads and may perform functions such as deblocking, video format transformation and motion compensation on the decoded video data to generate a video output 290 to a display device (not shown in
According to the particular illustrative embodiment of
In addition to allocating the macroblocks among the plurality of processing threads 206, the pre-processing thread 202 also may perform other functions. For example, the pre-processing thread may be used to parse the DC/AC coefficients 226 and 236 (in the single partition case) or to decode macroblocks. The pre-processing thread 202, like each of the plurality of processing threads 206, may be configured to perform macroblock decoding. As further described with reference to
Employing the lockless task buffer 204 to hold the macroblocks for the plurality of processing threads 206 also helps to improve decoding efficiency. Each of the plurality of processing threads 206 can retrieve macroblock data for decoding without waiting for a lock to be lifted, without waiting for operating system intervention, and without other delays that may result when the plurality of processing threads 206 do not have free access to a task buffer storing the macroblocks. Operation of the pre-processing thread 202 is described further with reference to
The dedicated threads operate to decode macroblocks of video data, including macroblock 0 (MB0) 390 through MB 5395.
Initially, each of the macroblocks, from macroblock MB0390 through MB5395 is stored in the task queue 310. Each of the macroblocks MB0390 and MB5395 is sequentially retrieved from the task queue 310 by the front-end thread 201, where the front-end thread 201 performs processing of frame header data, mode data, and motion vector data. The resulting macroblocks are then stored in a pre-processing thread task queue 320, The processor thread configured as the pre-processing (or “high-end”) thread 202 then retrieves each of the macroblocks from the pre-processing thread task queue 320. The pre-processing thread 202 assigns the macroblocks to one of the plurality of processing threads 206 and stores the macroblocks in a slot dedicated to the assigned processing thread in the lockless task buffer 204.
As further described below with reference to
Because one or more dedicated slots in the lockless task buffer 204 is associated with each of the plurality of processing threads 206, each of the plurality of processing threads 206 can access the lockless task buffer 204 to retrieve macroblocks without the task buffer having to be locked and without having to go through an operating system or other contention control system. Being able to directly and asynchronously access the lockless task buffer may avoid delays that may result from waiting for locks to be lifted or waiting for other contention control systems to provide access to the buffer.
In the example of
While the processing threads 350, 360, and 370 process the macroblocks 390, 391, and 392, the front-end thread 201 retrieves additional macroblocks such as macroblocks MB6496, MB7497, MB8498, and MB9499 from the task queue 310 and processes frame header data, prediction mode data, and motion vector data. The front-end thread 201 stores the macroblocks MB6496, MB7497, MB8498, and MB9499 in the pre-processing thread task queue 320. The pre-processing thread 202 retrieves macroblocks, such as the macroblock MB7497, from the pre-processing task queue 320 for coefficient parsing and assignment to a processing thread. The macroblocks MB3393, MB4394, and MB5395 have been retrieved from the pre-processing thread task queue 320 by the pre-processing thread 202 and slotted in the lockless task buffer 204 to assign the macroblocks MB3393, MB4394, and MB5395 to the first processing thread 350, the second processing thread 360, and the third processing thread 370, respectively.
In a particular illustrative embodiment, to facilitate workload balancing and to enhance throughput, the pre-processing thread 202 may assign macroblocks to itself and may act as an additional processing thread to decode one or more macroblocks. For example, before the processing threads 350, 360, and 370 retrieve the macroblocks MB3393, MB4394, and MB5395, respectively, from the lockless task buffer 204, the feedback link 332 indicates to the pre-processing thread 202 that the slots in the lockless task buffer 204 are filled. With the slots in the lockless task buffer 204 filled, if the pre-processing thread 202 assigns a next macroblock, macroblock MB6496, to one of the already filled slots, a video decoding backlog would result. Instead, the pre-processing thread 202 assigns decoding of the macroblock MB6496 to itself. In other words, instead of continuing to assign macroblocks to the processing threads 350, 360, and 370 that already have a next macroblock queued for processing, the pre-processing thread helps to avoid a potential backlog by devoting cycles to decoding the macroblock MB6496. When the pre-processing thread 202 completes decoding of the macroblock MB6496, the pre-processing thread 202 stores the decoded video in the frame buffer 340 and then retrieves a next macroblock, such as macroblock MB7497, for assignment to one of the processing threads 350, 360, and 370 (or to itself if the slots in the lockless task buffer 204 remain filled).
When one of the processing threads 350, 360, and 370 is available to receive and process a macroblock, the respective processing thread 350, 360, and 370 retrieves a macroblock from the respective slot 630, 631, and 632 associated with each of the processing threads 350, 360, and 370. Allocation of the dedicated slots 630, 631, and 632 to each of the respective processing threads 350, 360, and 370, respectively, enables each of the processing threads to retrieve an assigned macroblock from the lockless task buffer 204 whenever each of the processing threads completes decoding of a previously assigned macroblock and is ready to decode another macroblock. Because the slots 630, 631, and 632 are dedicated to individual processing threads 350, 360, and 370, respectively, the processing threads only retrieve macroblocks from their own dedicated slots, and do not contend for macroblocks assigned to other slots. Thus, the lockless task buffer 204 may be accessed independently and asynchronously by the processing threads 350, 360, and 370 without locking or other contention control mechanisms. The lockless task buffer 204 may thus avoid delays in supplying macroblocks to processing threads.
Each of the slots 630, 631, and 632 is associated with a flag 640, 641, and 642, respectively, to signal when each of the slots 630, 631, and 632 stores a macroblock for a respective processing thread 350, 360, or 370. In a particular illustrative embodiment, the flags 640, 641, and 642, are set when a macroblock is stored in the respective slot 630, 631, and 632. The flags 640, 641, and 642 are cleared when no macroblock is stored in the respective slot 630, 631, and 632, signaling to the respective processing thread 350, 360, and 370 that there are no macroblock waiting to be decoded. When no macroblock is stored in the dedicated slot 640, 641, or 642 for one of the respective processing threads 350, 360, or 370, the respective processing thread 350, 360, or 370 may assume a standby or sleep state.
In the example of
In the example of
The pre-processing thread 202 assigned the macroblock MB14794 to the second slot 631 because the second flag 641 (
The wireless device 900 may be implemented in a portable electronic device and includes the multi-threaded processor 110, which may include a digital signal processor (DSP). The multi-threaded 110 processor is associated with a memory such as a firmware 120 that includes instructions enabling the multi-threaded processor 110 to configure threads to perform different dedicated functions as previously described with reference to
A camera interface 968 is coupled to the multi-threaded processor 110 and also coupled to a camera, such as a video camera 970. A display controller 926 is coupled to the multi-threaded processor 110 and to a display device 928. A general coder/decoder (general CODEC) 934 can also be coupled to the processor 110. A speaker 936 and a microphone 938 can be coupled to the general CODEC 934 to encode or decode audio data or to encode and decode other types of video data. A wireless interface 940 can be coupled to the processor 110 and to a wireless antenna 942. Via the wireless interface 940, the wireless device 900 may receive streamed or downloadable VP6 format data to be decoded by the multi-threaded processor 110 configured according to the instructions stored in the firmware 120 for configuring threads of the multi-threaded processor 110 to perform VP6 decoding.
In a particular embodiment, the multi-threaded processor 110, the display controller 926, the memory 932, the CODEC 934, the wireless interface 940, and the camera interface 968 are included in a system-in-package or system-on-chip device 922. In a particular embodiment, an input device 930 and a power supply 944 are coupled to the system-on-chip device 922. Moreover, in a particular embodiment, as illustrated in
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software executed by a processing unit, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or executable processing instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), a magnetoresistive random access memory (MRAM), a spin-torque-transfer MRAM (STT-MRAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description of the disclosed embodiments is provided to enable a person skilled in the art to make or use the disclosed embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.