Portable computing devices (“PCDs”) are becoming necessities for people on personal and professional levels. These devices may include cellular telephones, portable digital assistants (“PDAs”), portable game consoles, palmtop computers, and other portable electronic devices. PCDs commonly contain integrated circuits, or systems on a chip (“SoC”), that include numerous components designed to work together to deliver functionality to a user. For example, a SoC may contain any number of processing engines such as modems, central processing units (“CPUs”) made up of cores, graphical processing units (“GPUs”), etc. that read and write data and instructions to and from memory components on the SoC.
The efficient use of bus bandwidth and memory capacity in a PCD is important for optimizing the functional capabilities of processing components on the SoC. Multi-media applications on a PCD can use significant amounts of bandwidth and storage resources. For instance, the transmission and/or display of digital video or image frames require memory, buffers, channels, and buses that can support a large volume of bits. Conventionally, image data is presented in frames comprising pixels, with the higher resolution images comprising many frames and a large number of pixels.
Commonly, data compression is used to increase bandwidth availability (such as a bus bandwidth) for data being sent to a memory component through a memory controller or via direct memory access (DMA). Typical compression systems and methods can actually work to reduce efficiency in transmitting the image data and/or accessing the memory component (bytes per clock cycle). Such inefficiencies may for example be caused by the need to buffer portions of the frames comprising the image data while awaiting compression to keep the data of the frames in a required data stream order for a recipient device or component such as a decoder. Therefore, there is a need in the art for a system and method that addresses the inefficiencies associated with compressing multi-media data, and for more rapid multi-media data transactions.
Various embodiments of methods and systems for out-of-stream-order compression of multi-media data tiles in a system on a chip (“SoC”) of a portable computing device (“PCD”) are disclosed. An exemplary method begins receiving an input data transaction comprising an uncompressed data tile. A header pixel of at least one sub-tile of the received uncompressed data tile is extracted, where the sub-tile comprises a plurality of data blocks received in an input order. The plurality of data blocks are encoded in the input order, an Idx code for each of the plurality of encoded data blocks is stored in a stream buffer. The header pixel, a BFLC code for each of the plurality of encoded data blocks, and the Idx code for each of the plurality of encoded data blocks from the stream buffer are packed into an output format.
In the drawings, like reference numerals refer to like parts throughout the various views unless otherwise indicated. For reference numerals with letter character designations such as “102A” or “102B”, the letter character designations may differentiate two like parts or elements present in the same figure. Letter character designations for reference numerals may be omitted when it is intended that a reference numeral encompass all parts having the same reference numeral in all figures.
The word “exemplary” is used herein to mean serving as an example, instance, or illustration. Any aspect described herein as “exemplary” is not necessarily to be construed as exclusive, preferred or advantageous over other aspects.
In this description, the term “application” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, an “application” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
In this description, reference to “DRAM” or “DDR” memory components will be understood to envision any of a broader class of volatile random access memory (“RAM”) and will not limit the scope of the solutions disclosed herein to a specific type or generation of RAM. That is, it will be understood that references to “DRAM” or “DDR” for various embodiments may be applicable to DDR, DDR-2, DDR-3, low power DDR (“LPDDR”) or any subsequent generation of DRAM.
As used in this description, the terms “component,” “database,” “module,” “system,” and the like are intended to refer generally to a computer-related entity, either hardware, firmware, a combination of hardware and software, software, or software in execution, unless specifically limited to a certain computer-related entity. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
By way of illustration, both an application running on a computing device and the computing device may be a component. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components may execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal).
In this description, the terms “central processing unit (“CPU”),” “digital signal processor (“DSP”),” “graphical processing unit (“GPU”),” and “chip” are used interchangeably. Moreover, a CPU, DSP, GPU or chip may be comprised of one or more distinct processing components generally referred to herein as “core(s).”
In this description, the terms “engine,” “processing engine,” “processing component” and the like are used to refer to any component within a system on a chip (“SoC”) that transfers data over a bus to or from a memory component. As such, a processing component may refer to, but is not limited to refer to, a CPU, DSP, GPU, modem, controller, etc.
In this description, the term “bus” refers to a collection of wires through which data is transmitted from a processing engine to a memory component or other device located on or off the SoC. It will be understood that a bus consists of two parts—an address bus and a data bus where the data bus transfers actual data and the address bus transfers information specifying location of the data in a memory component (i.e., metadata). The terms “width” or “bus width” or “bandwidth” refers to an amount of data, i.e. a “chunk size,” that may be transmitted per cycle through a given bus. For example, a 16-byte bus may transmit 16 bytes of data at a time, whereas 32-byte bus may transmit 32 bytes of data per cycle. Moreover, “bus speed” refers to the number of times a chunk of data may be transmitted through a given bus each second. Similarly, a “bus cycle” or “cycle” refers to transmission of one chunk of data through a given bus.
In this description, the term “portable computing device” (“PCD”) is used to describe any device operating on a limited capacity power supply, such as a battery. Although battery operated PCDs have been in use for decades, technological advances in rechargeable batteries coupled with the advent of third generation (“3G”) and fourth generation (“4G”) wireless technology have enabled numerous PCDs with multiple capabilities. Therefore, a PCD may be a cellular telephone, a satellite telephone, a pager, a PDA, a smartphone, a navigation device, a smartbook or reader, a media player, a wearable device, a combination of the aforementioned devices, a laptop computer with a wireless connection, among others.
To make efficient use of bus bandwidth and/or DRAM capacity, data is often compressed according to lossless or lossy compression algorithms, as would be understood by one of ordinary skill in the art. Because the data is compressed, it takes less space to store and uses less bandwidth to transmit. However, because DRAM typically requires a minimum amount of data to be transacted at a time (a minimum access length, i.e. “MAL”), a transaction of compressed data may require filler data to meet the minimum access length requirement. Filler data or “padding” is used to “fill” the unused capacity in a transaction that must be accounted for in order to meet a given MAL.
Multi-media applications on a PCD can use significant amounts of bandwidth and storage resources. For instance, the transmission and/or display of digital video or image frames require buses that can support a large volume of bits. Conventionally, such video and image data is presented in frames comprising pixels, with the higher resolution images comprising many frames and a large number of pixels. Frames may themselves be broken down into 256-byte data tiles comprised of pixels. Depending on the standard, the frame may be broken down into separate 256-byte data tiles for the luma/brightness (typically represented by “Y”) and chroma/color (typically represented by “UV”), and may be configured in different manners.
For example,
Compressing the image data contained in image tile 300A typically requires buffering the 4-pixel×4-pixel data blocks 303, 305, 307, 309 in order compress the pixels into a data stream where the pixels are arranged in the data stream in the order required by a receiving device such as a decoder (referred to herein as “in order” compression). For example, typical compression of the image tile 300A requires compressing the “0” 4-pixel×1-pixel portion of the 1st sub-tile 302, then the “0” 4-pixel×1-pixel portion of the 2nd sub-tile 304, then the “0” 4-pixel×1-pixel portion of the 3rd sub-tile 306, followed by the “0” 4-pixel×1 pixel portion of the 4th sub-tile 308.
The process would repeat for the “1” 4-pixel×1-pixel portions of the sub-tiles 302, 304, 306, 308, the “2” 4-pixel×1-pixel portions of the sub-tiles 302, 304, 306, 308, etc., to place the compressed pixel data of the image tile 300A into a data stream in the order needed by a recipient component such as a decoder. This compression scheme requires multiple buffers to hold the various uncompressed sub-tile 302, 304, 306, 308 pixel data while waiting for compression. Such buffers result in inefficient compression, slowing throughput, and can also take up valuable area on already over-crowded SoCs.
Other formats of multi-media tiles face the same problem.
Additionally, each 4-pixel×4-pixel data block 323, 325, 327, 329 may contain 4-pixel×1-pixel portions, illustrated in
The present disclosure provides cost effective and efficient systems and methods out-of-stream-order compression of multi-media data tiles, such as the image tiles 300A and 300B of
In general, multi-media (“MM”) CODEC module 113 may be formed from hardware and/or firmware and may be responsible for performing out-of-stream-order compression of multi-media data tiles. It is envisioned that multi-media data tiles, such as image tiles 300A or 300B, for instance, may be compressed out-of-stream-order according to a lossless or lossy compression algorithm executed by an image CODEC module 113 and combined into a data stream/transaction that may be processed by a receiving component such as a decompression module (not shown in
As illustrated in
As further illustrated in
The CPU 110 may also be coupled to one or more internal, on-chip thermal sensors 157A as well as one or more external, off-chip thermal sensors 157B. The on-chip thermal sensors 157A may comprise one or more proportional to absolute temperature (“PTAT”) temperature sensors that are based on vertical PNP structure and are usually dedicated to complementary metal oxide semiconductor (“CMOS”) very large-scale integration (“VLSI”) circuits. The off-chip thermal sensors 157B may comprise one or more thermistors. The thermal sensors 157 may produce a voltage drop that is converted to digital signals with an analog-to-digital converter (“ADC”) controller (not shown). However, other types of thermal sensors 157 may be employed.
The touch screen display 132, the video port 138, the USB port 142, the camera 148, the first stereo speaker 154, the second stereo speaker 156, the microphone 160, the FM antenna 164, the stereo headphones 166, the RF switch 170, the RF antenna 172, the keypad 174, the mono headset 176, the vibrator 178, thermal sensors 157B, the PMIC 180 and the power supply 188 are external to the on-chip system 102. It will be understood, however, that one or more of these devices depicted as external to the on-chip system 102 in the exemplary embodiment of a PCD 100 in
In a particular aspect, one or more of the method steps described herein may be implemented by executable instructions and parameters stored in the memory 112 or the multi-media CODEC module 113. Further, the multi-media CODEC module 113, the memory 112, the instructions stored therein, or a combination thereof may serve as a means for performing one or more of the method steps described herein.
Turning to
Bus 211 may include multiple communication paths via one or more wired or wireless connections, as is known in the art and described above in the definitions. The bus 211 may have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, to enable communications. Further, the bus 211 may include address, control, and/or data connections to enable appropriate communications among the aforementioned components.
The processing engine(s) 201 may be part of CPU 110 comprising a multiple-core processor having N core processors. As is known to one of ordinary skill in the art, each of the N cores is available for supporting a dedicated application or program. Alternatively, one or more applications or programs may be distributed for processing across two or more of the available cores. The N cores may be integrated on a single integrated circuit die, or they may be integrated or coupled on separate dies in a multiple-circuit package. Designers may couple the N cores via one or more shared caches and they may implement message or instruction passing via network topologies such as bus, ring, mesh and crossbar topologies.
As is understood by one of ordinary skill in the art, the processing engine(s) 201, in executing a workload could be fetching and/or updating instructions and/or data that are stored at the address(es) of the memory 112. Additionally, as illustrated in
As the processing engines 201 generate data transfers for transmission via bus 211 to memory 112 and/or display 232 multi-media CODEC module 113 may compress tile-sized units of an image frame to make more efficient use of DRAM 115 capacity and/or bus 211 bandwidth. As discussed below, the multi-media CODC module 113 may be configured to perform out-of-stream compression of the data tiles for the image frame. The out-of-stream compression of the data tiles may be stored in memory 112 and/or provided to decoder 215 in a data stream that the decoder 215 may act on to decompress the data tiles for viewing on the display 232. In this description, the various embodiments are described within the context of an image frame made up of 256-byte tiles.
Notably, however, it will be understood that the 256-byte tile sizes, as well as the various compressed data transaction sizes, are exemplary in nature and do not suggest that embodiments of the solution are limited in application to 256-byte tile sizes. As such, one of ordinary skill in the art will recognize that the particular data transfer sizes, chunk sizes, bus widths, etc. that are referred to in this description are offered for exemplary purposes only and do not limit the scope of the envisioned solutions as being applicable to applications having the same data transfer sizes, chunk sizes, bus widths, etc. As will become more apparent from further description and figures, out-of-stream order compression, may improve the effectiveness and transaction throughput of the multi-media encoder module 113, while at the same time reducing the footprint on the SoC required for the encoder module 113 resulting in cost and manufacturing savings.
Turning to
The input transaction of multi-media tiles received by the Unpacker 410 comprises uncompressed pixel data (“source pixels”). The received input transaction may be arranged as 4-pixel×4-pixel data blocks 303, 305, 307, 309 (see
After receiving the input transaction, the Unpacker 410 extracts header pixels for the sub-tiles of a received tile in the input transaction, such as header pixels for sub-tiles 302, 304, 306, 308 of
Unpacker 410 forwards the source pixels of each received block unit to the Block Encoder 420 for compression in the order the block units are received by the Unpacker 410 in the input data stream. In other words, the encoder 400 of
Finally, Unpacker 410 provides a neighbor pixel update to Neighbor Manager 440. The neighbor pixel update comprises information about one or more pixels adjacent to or adjoining the pixel being compressed by the Block Encoder 410. In an embodiment, the Neighbor Manager 440 receives from Unpacker 410 and stores information about the neighbor pixels to the pixels being sent to the Block Encoder 420 for compression. Such information may include values for the neighbor pixel(s) as well as header pixels for sub-tile neighbors, etc.
Neighbor Manager 440 then provides this neighbor information for each pixel as the pixel is being compressed by the Block Encoder 410, enabling better compression performance and/or predictability. Neighbor Manager 440 is continually receiving neighbor pixel information updates from the Unpacker 410 corresponding to source pixels the Unpacker 410 is forwarding to the Block Encoder 420. Neighbor Manager 440 stores such neighbor pixel information until needed by the Block Encoder 420 and forwards the neighbor pixel information to the Block Encoder 420.
In an embodiment, Neighbor Manager 440 provides values or information about the left, top-left, and top neighbors to the pixel currently being encoded by Block Encoder 420. Neighbor Manager 440 may in some embodiments simultaneously provide neighbor pixel information for multiple pixels being compressed by the Block Encoder 420, such as for example, neighbor pixels for a 4-byte×4-byte data block 303, 305, 307, 309 (see
Block Encoder 420 receives the source pixels from the Unpacker 410 and the neighbor pixels from Neighbor Manager 440 and encodes/compresses the pixels of the received block unit using any desired algorithm. As discussed above, Block Encoder 420 encodes the block unit pixels in the order that the block units are received by the Unpacker 410, rather than in an order required by a downstream component such as a decoder.
For example,
The present system and method do not buffer the data blocks 503, 505, 507, 509 in order to compress the portions of each sub-tile 502, 504, 506, 508 in output data stream order as discussed above for
By way of another example,
However, as illustrated in
Returning to
For example, in an embodiment, each encoding engine/process may be able to process a 4×4-pixel block per clock cycle such as data blocks 303, 305, 307, 309 of
In another embodiment, each encoding engine/process may be able to process a 4×4-pixel block per clock cycle such as data blocks 323, 325, 327, 329 of
Block Encoder 420 may also make a determination whether a 256-byte tile will be output from the encoder 400 as compressed blocks or whether the uncompressed source pixels of the 256-byte tile will be output from the encoder 400. Such uncompressed source pixels output from the encoder 400 are referred to herein as a “PCM tile.” This determination may be made by the Block Encoder 420 based on the size of the data tile after compression.
In an embodiment, the data tile may be encoded/compressed by the Block Encoder 420 into a compressed tile having a size that is multiples of 32 bytes (i.e. 32 bytes, 64 bytes, 96 bytes, etc.) in case the compressed blocks are sent to an external memory. In such embodiments, if the compressed tile is 224 bytes or greater, the compressed tile is discarded, and the uncompressed data tile will be output from the encoder 400 as a PCM tile.
After encoding the received block units, Block Encoder 420 outputs the Idx codes to a Stream Buffer 430 and the BFLC codes to the Output Packer 450. Note that in cases where the Block Encoder 420 determines that the certain 4×1 source pixels should be output uncompressed, the corresponding BFLC code may indicate that certain 4×1 pixel block is a PCM block.
Stream Buffer 430 stores the 4-pixel compressed (or PCM) blocks from the Block Encoder 420, adding compressed (or PCM) blocks for a multi-media tile as they are received from the Block Encoder 420, until the Output Packer 450 is ready to send an output transaction as described below. Stream Buffer 430 stores the compressed blocks, and provides the compressed blocks to Output Packer 450, in output stream order—i.e. an order needed by a downstream component such as a decoder to decompress the multi-media tile. Stream Buffer 430 may be implemented with a flop array or RAM memory as desired in a variety of configurations.
For example, Stream Buffer 430 may comprise a 128-bit×16 bit flop array structure to store an entire multi-media tile, addressed in block linear order for each sub-tile. In another embodiment, where four sub-encoding engines of Block Encoders 420 are implemented, the Stream Buffer 420 may comprise four 40 (width)×16 (height) dual port Ram memories that are word writable. Block addresses in such an implementation may be mapped in a way to support a 4-block write/read per clock cycle. As would be understood, this implementation allows for more throughput, but requires a larger total chip area for the RAM memory. In yet another embodiment, where only two sub-encoding engines of Block Encoders 420 are implemented, Stream Buffer 420 may comprise two 40 (width)×32 (height) dual port RAM memories that are word writable. This implementation provides less throughput, but also requires a smaller total chip area for the RAM memory.
The encoder 400 of
Note that where the BFLC codes and/or Idx values received at the Output Packer 450 indicate that the output will be a PCM (uncompressed) tile, the Output Packer 450 will convert the PCM tile to an appropriate format for transmission in an Output transaction. In an embodiment, when the Output Packer 450 receives such indication that the output will be a PCM tile the Output Packer 450 may perform such conversion on the source pixels received from the Unpacker 410 as mentioned above. In some implementations, the Output Packer 450 may send a signal to the Unpacker 410 to re-send the source pixels prior to performing such conversion.
Encoder 400 also includes an Encoder Controller 460 that controls the flow of information between the other portions of encoder 400 as described above. As will be understood, Unpacker 410, Block Encoder 420, neighbor Manager 440, Output Packer 450, and Encoder Controller 460 may be implemented in hardware, software, or both in various embodiments. Additionally, encoder 400 may include more or fewer components or modules than those illustrated in
Turning to
At block 804, header pixels are extracted from the sub-tiles of a received tile in the input transaction, such as header pixels for sub-tiles 302, 304, 306, 308 of
Method 800 continues to block 808 where each block of source pixels is encoded/compressed in the same order as the data input stream/transaction of block 802. The encoding/compression may be performed by one or more Block Encoder(s) 420, and in some embodiments each Block Encoder 420 may comprise multiple sub-encoding/compressing engines operating in parallel. In an embodiment, the Unpacker 410 forwards the source pixels of each received block unit to the Block Encoder 420 for compression in the order the block units are received by the Unpacker 410 in the input data stream/transaction. In other words, the Unpacker 400 does not use input buffers or otherwise re-arrange the received block units into a data stream order required by a downstream component (such as a decoder) before sending the source pixels to the Block Encoder 420 for encoding/compression.
In some embodiments, the encoding in block 808 may also be performed using neighbor pixel information related to a pixel being compressed in order to better and/or more efficiently compress or encode the pixel. As discussed above, Block Encoder 420 may receive such neighbor pixel information from a Neighbor Manager 440 of encoder 400, where the Neighbor Manager 400 receives neighbor pixel updates from Unpacker 410 as illustrated in
In block 810 BFLC and Idx codes or values are generated by the Block Encoder 420 for each data block as part of the encoding/compressing of the data block. Each block's Idx codes or values are buffered in block 810, such as in Stream Buffer 430. Stream Buffer 430 may store 4-pixel compressed blocks from the Block Encoder 420 in an embodiment, adding compressed blocks for a multi-media tile as they are received from the Block Encoder 420, until the Output Packer 450 is ready to send an output transaction as described above. Stream Buffer 430 may store the compressed blocks, and provide the compressed blocks to Output Packer 450, in output stream order—i.e. an order needed by a downstream component or module such as a decoder to decompress the multi-media tile.
In block 812, the Block Encoder 420 may determine whether a multi-media tile will be output from the encoder 400 as compressed tile or whether the uncompressed source pixels (arranged in PCM tile format) will be output from the encoder 400. In an embodiment, this determination may be made by the Block Encoder 420 based on the size of the data tile after compression. Depending on the determination at block 812, method 800 continues to either block 814 (output compressed tile) or block 816 (output PCM tile).
In the event that the determination at block 812 is to output compressed blocks, method 800 continues to block 814 where the compressed blocks are packed into the output format. In an embodiment, Output Packer 450 receives for each block, the header pixels from the Unpacker 410, the BFLC codes from the Block Encoder 420, and the Idx values in stream order from the Stream Buffer 430. For compressed blocks, Output Packer 450 inserts the header pixel field, BFLC field, and any padding needed in the padding field for each compressed block, resulting in an output transaction. Method 800 then continues to block 818 discussed below.
In the event that the determination at block 812 is to output uncompressed blocks, method 800 continues to block 816 where the PCM tiles are processed. The Output Packer 450 will convert the input tile to the PCM tile format for transmission in an output transaction. In an embodiment, when the Output Packer 450 receives such indication that the output will be a PCM tile the Output Packer 450 may perform such conversion on the source pixels received from the Unpacker 410. The Encoder Controller 460 will send a signal to the upstream module to re-send the source pixels prior to performing such conversion.
Method 800 continues from either block 814 or 816 to block 818 where the final output transaction is generated. Note that in some embodiments of method 800 block 818 may not be a separate step, but may be part of step 814 for compressed tiles and/or step 816 for PCM tiles. Generating the final output transaction, may comprise Output Packer 450 packing the compressed (or PCM) data into an output interface format, which may be 128-bit per output transaction. Additionally, the Output Packer 450 may add metadata to each tile, where the metadata is configured to inform downstream components or modules how big the compressed media tile is. Method 800 then returns.
As noted above for
The various elements may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. In the context of this document, a “computer-readable medium” can be any means that can store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The computer-readable medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic) having one or more wires, a portable computer diskette (magnetic), a random-access memory (RAM) (electronic), a read-only memory (ROM) (electronic), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory) (electronic), an optical fiber (optical), and a portable compact disc read-only memory (CDROM) (optical). Note that the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, for instance via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
In an alternative embodiment, where one or more of Unpacker 410, Block Encoder 420, neighbor Manager 440, Output Packer 450, and/or Encoder Controller 460 are implemented in hardware, the various hardware logic may be implemented with any or a combination of the following technologies, which are each well known in the art: a discrete logic circuit(s) having logic gates for implementing logic functions upon data signals, an application specific integrated circuit (ASIC) having appropriate combinational logic gates, a programmable gate array(s) (PGA), a field programmable gate array (FPGA), etc.
Certain steps in the processes or process flows described in this specification naturally precede others for the invention to function as described. However, the invention is not limited to the order of the steps described if such order or sequence does not alter the functionality of the invention. That is, it is recognized that some steps may performed before, after, or parallel (substantially simultaneously with) other steps without departing from the scope and spirit of the disclosure. In some instances, certain steps may be omitted or not performed without departing from the invention. Further, words such as “thereafter”, “then”, “next”, etc. are not intended to limit the order of the steps. These words are simply used to guide the reader through the description of the exemplary method.
Although selected aspects of certain embodiments have been illustrated and described in detail, it will be understood that various substitutions and alterations may be made therein without departing from the spirit and scope of the present disclosure, as defined by the following claims.