The present invention relates to the optimization of access to computer memory, including methods thereto relating as well as a cache memory device and methods for its use. The cache memory device is suited for memory-intensive applications, such as, but not exclusively, those involved in image and motion picture processing. Image processing applications include processing motion picture streams (including high definition video) to facilitate real-time processing tasks such as, but not exclusively, motion estimation, compression, and decompression. Similar procedures are used in other applications that involve searching for data correlation between images or 2-dimensional arrays, for example: image recognition, robotic vision, still image data compression and so on. All of these applications are characterized by their need to repeatedly access large amounts of data stored in a main memory subsystem.
As is well-known, memory-intensive computing operations can be accelerated through the judicious use of faster, generally more costly, cache memory devices. Such cache memory devices store temporary copies of portions of a main memory subsystem and can transmit those copies to a requesting client processor. This provides faster access to the required data stored in that memory if it is present in the cache memory due to a prior reference or a pre-fetch issued in anticipation of the request.
Image processing in general, and digital motion picture processing specifically, represents a class of computing applications that require a huge amount of memory bandwidth. In various video applications, such as video cameras, television, set top boxes, and DVD's, video frame rate and resolution is steadily increasing to improve video quality and the viewer experience. This results in an increasing burden on video processing circuits involved in the acquisition, transmission and playback of video.
To cope with increasing video frame rate and resolution, video compression schemes, such as those defined by the MPEG and MPEG4 Part 10 (also called H.264 or AVC) industry standards, are typically employed to reduce the video bit rate and corresponding storage and transmission bandwidth requirements. Naturally, this requires compression prior to transmission or storage and decompression prior to use or display.
In certain parts of the video processing chain, however, the processing circuits must work with the uncompressed video samples. This is the case, for example, in the compression circuit itself, which must receive video in uncompressed form and convert it into a compressed form. Similarly, the video decompression circuit performs the opposite task, namely decoding the compressed video to uncompressed form, for example for display or transcoding. Thus, these image processing tasks are typical of those requiring very high data throughput and memory bandwidth.
One common method used in the art in motion picture compression schemes is to encode temporal redundancy (i.e., information shared among two or more frames) through motion estimation (and its counterpart in decompression, motion compensation). In motion estimation, one must search for similarities between rectangular portions of different frames within a motion picture sequence. Starting with a frame to be encoded (the “current frame” or CF) and a specific rectangular region (the “current block” or CB), one must identify a sufficiently similar region (the “reference block” or RB) in another frame (the “reference frame” or RF). The encoding (i.e., compression) entails recording the horizontal and vertical displacement of the CB from the RB as a motion vector and a representation of the differences between the CB and RB. Motion compensation reverses this process to reconstruct the original picture.
Typically in the motion estimation process, the CF is divided into a grid of rectangular areas called macroblocks (for example, the MPEG standard specifies a square area, 16 pixels by 16 pixels). Then, for each macroblock of the CF, a search is made on past and future frames, to find similar areas that can be used as a reference to efficiently encode the current macroblock.
Each candidate RB is brought into the processing circuit, and compared to the current macroblock using an error measurement cost function (for example, the sum of absolute differences (SAD) is commonly used). Potential candidates are those that fall below an arbitrarily determined error or cost threshold. The motion estimation process then chooses the best candidate among a potentially large group of contenders.
This type of search used in motion estimation is one example in which large amounts of video storage and memory bandwidth are required. During the search processing of a single macroblock, many candidate RBs must be transferred into the motion estimation processing circuit from external video frame storage memories.
Other examples of image processing that require repeated access to image data include intra-frame compression, image enhancement, image recognition, robotic vision, and indeed any processes that require access to many pixels to generate one or more output pixels. Similarly, many other applications outside the realm of image processing require access to large amounts of memory to perform their tasks.
The art has been employing simple cache devices to reduce the video memory access bandwidth requirements of the motion estimation circuit. Usually a single rectangular search window (known as a “reference window” or RW) around the co-located position of the current macroblock in a particular reference frame is cached in on-chip memory. U.S. Pat. No. 5,696,698, which is incorporated herein by reference in its entirety, describes such a device for addressing a single rectangular search area cache memory of a motion picture compression circuit. U.S. Pat. No. 7,006,100, which is also incorporated herein by reference in its entirety, notes that it is difficult to use such a device to support two such search areas simultaneously without resorting to the trivial, yet expensive solution of duplicating the cache memory area in its entirety. U.S. Pat. No. 7,006,100 continues to describe an alternative cache device that can be dynamically configured per frame to one of two modes, i.e., to store either two independent rectangular small search areas or a single logical wide rectangular search area.
State of the art video compression standards, such as the H.264 standard, have introduced new requirements. These improve previous standards such as MPEG-II, by allowing the current macroblock to be further sub-divided into smaller sub-blocks, where each sub-block may independently specify a motion vector. Further, advances in multi-frame prediction allow many reference frames to participate in the generation of the reference area of the current macroblock, by allowing each sub-macroblock motion vector to specify a different reference frame. In addition, fine grained sub-pixel interpolation requires more surrounding pixels around each reference macroblock area as compared to older compression standards. These advances exacerbate the memory bandwidth problem, since modern video compression motion estimation must now consider many more candidates to make a good selection, thus increasing memory bandwidth demands.
Duplicating conventional cache devices for a single or two rectangular areas to support a much larger number of areas is prohibitively expensive as well as inefficient. The current art lacks a memory caching device that allows motion estimation or other video processing circuits to load and search many reference candidate areas of arbitrary size and shape, including non-rectangular ones, across many reference frames. Furthermore, current video cache designs are limited to predetermined patterns of size and shape and cannot adapt to the dynamic content of the video stream.
Another limitation of the current art is the need for a cache client (i.e., a processor requesting memory) to wait until a memory request has been fully satisfied before a subsequent one may be handled. This prevents or delays the handling of requests for other regions that may, in fact, already be cached and could be delivered while the missed data is being fetched from main storage.
Current image cache designs are further limited in the way they provide data to the client, in that they are typically optimized, if at all, for either progressive (full-frame) or interlaced image handling, but not both. These issues are particularly pertinent with respect to performing a 3:2 pulldown process. 3:2 (or 2:3) pulldown processing is commonly used in video encoding devices for material that originated in cinema film at 24 frames per second and must be converted into NTSC format at 29.97 frames per second or vice versa. This type of processing is also known as telecine, inverse or reverse telecine, cadence correction, and inverse or reverse pulldown.
Another capability lacking in current cache designs is a mechanism to programmatically adjust the mapping between the main memory's address space and the physical storage used for the cache memory to provide optimal performance depending upon the application.
Another limitation of the current art of memory caches used in image and motion picture processing appears in the problem of access to the cache memory components. Typically such cache devices are single-threaded and do not need to simultaneously access the cache memory for writing data retrieved from the main memory and reading from the cache memory to transmit data to the requesting client. In such a case, less expensive single-port memory components are appropriate. If a cache is to be multithreaded, the current art would specify more expensive dual-port memory components or would have to single-thread internal operations against a bank of single-port memories, compromising the cache device's performance.
Another factor militating against the creation of a multithreaded cache device is the need for a highly efficient control structure to facilitate communication between the various sub-processors working together within the cache device.
While the art cited above provides simple architectural features to optimize caching for macroblocks used in motion estimation, the art lacks a generalized architectural approach that would optimize caches for sub-macroblocks and the more general case of caching two-dimensional data.
There is thus a widely recognized need for, and it would be highly advantageous to have, a cache memory device devoid of the above limitations.
According to one aspect of the present invention there is provided a cache memory device for use in an image or motion picture processing system, said cache memory device being located between a main memory and a requesting processor, the main memory storing images, said images having an image width and an image height, said images being divisible into blocks, each block having a block width and a block height being less than or equal to the image width and image height respectively, the cache memory device being configured so as to temporarily locate arbitrary ones of said blocks in said cache memory device thereby to improve retrieval performance.
According to another aspect of the present invention there is provided a cache memory device for use in image or motion picture processing systems, said cache memory device being located between a main memory and a requesting processor, the main memory storing images, each image having an image width and an image height, each said image being divisible into blocks, each block having a block width and a block height being less than or equal to the image width and image height respectively, the requesting processor being configured to issue requests to the cache memory device for arbitrary portions of an image stored in the main memory, said requests having a request width and request height less than the image width and image height respectively, the cache memory device being configured so as to temporarily locate arbitrary ones of said blocks in said cache memory device to improve retrieval performance, and the cache memory device comprising a cache logic circuit engine able to service multiple requests from the requesting processor simultaneously.
According to another aspect of the present invention there is provided a cache memory device for location between a main memory and a requesting processor, the main memory storing memory blocks, some of which are temporarily located in said cache memory device to improve retrieval performance, said cache memory device configured to receive requests for respective ones of said memory blocks, said cache memory device comprising:
an input pooling unit for pooling incoming requests for blocks of memory; and
a request selection and servicing mechanism configured for selecting amongst and servicing requests in said pool for memory block retrieval, said selecting and servicing being according to a first optimization criterion, thereby to optimize operation of said cache.
According to another aspect of the present invention there is provided a method for storing and delivering memory blocks from a memory storage device to a client processor requesting said memory blocks, said memory storage comprising a plurality of independently accessible memory banks, said memory blocks being of a given width and height such that the height comprises one or more successive groups of four rows, each said group having a first, second, third, and fourth row, successively; the method comprising storing the rows within each said group such that the first and fourth rows are stored in one of said plurality of memory banks, and the second and third rows are stored in another of said plurality of memory banks, thereby permitting concurrent transmission of data from any one of the following combinations of rows:
First row and second row, or
Third row and fourth row, or
First row and third row, or
Second row and fourth row.
According to another aspect of the present invention there is provided a cache memory device for location between a main memory and a requesting processor, the main memory storing memory blocks, some of which are temporarily located in said cache memory device to improve retrieval performance, and comprising a plurality of single-port cache memory components for storing respective memory blocks, said cache memory device configured with a controller to select memory blocks for transmission from said cache memory device to the requesting processor according to a first criterion, the first criterion being that writing of data is permitted to a first of said memory components and reading of data simultaneously with said writing is permitted from at least one other of said memory components.
According to another aspect of the present invention there is provided a cache memory device for location between a main memory and a requesting processor, the main memory storing memory blocks, some of which are temporarily located in said cache memory device to improve retrieval performance, said cache memory device configured to receive requests for respective ones of said memory blocks, said cache memory device comprising a content-addressable memory structure for maintaining the state of the cache memory and the relationship between the main memory's address space and the cache memory's address space.
According to another aspect of the present invention there is provided a cache memory device for location between a main memory configured to store an image of a given width Wimage and height and a requesting processor, the image comprising memory blocks, some of which are temporarily located in said cache memory device to improve retrieval performance, said cache memory device configured to receive requests for respective ones of said memory blocks, said cache memory device comprising a plurality of J sub-caches, each sub-cache comprising cache blocks of a given width Wcache-block and height, said Wcache-block being less than Wimage, and the image being logically divided into groups of J vertical stripes, each said vertical stripe being of width Wcache-block, and each sub-cache being associated with exactly one vertical stripe of each group of J vertical stripes.
According to another aspect of the present invention there is provided an apparatus for accepting a plurality of memory requests from a requesting processor, also known as a client, and may use various criteria to optimize the performance of the cache.
According to another aspect there is provided an output pooling unit to buffer the results of a plurality of memory requests. Interim results of fetching memory from the main storage are stored in the output pooling unit. When a memory request is fully satisfied and stored in the output pooling mechanism, the results are transmitted to the requesting processor.
According to another aspect of the present invention there is provided a method for storing data within the cache memory, said method permitting simultaneous transmission to the requesting processor of either consecutive rows of memory or alternating rows of memory, according to the request of the requesting processor. Note that the terms “row” or “rows” and “line” or “lines” are used interchangeably when referring to the horizontal portions of a rectangular portion of memory.
According to another aspect of the present invention there is provided an apparatus for using relatively inexpensive, single port memory components as the cache memory storage while supporting simultaneous read and write operations to the overall cache memory by forbidding simultaneous read and write access to an individual component.
According to another aspect of the present invention is provided a Meta-cache Control Block (MCB)—a content-addressable memory structure to maintain state information about the cache memory. The MCB provides a plurality of planes of elements, where each plane corresponds to a single memory request and the elements therein correspond both to portions of the requested memory block and to the cache memory blocks that will hold those portions. The MCB maintains the state of each of the cache memory blocks and the mapping between the main memory address space and the cache memory address space. A further refinement of the address-mapping function incorporates a programmable address-shuffling mechanism to allow fine-tuning and optimization of the mapping from the main memory address space into the cache memory address space.
According to yet a further aspect of the present invention there is provided the use of a plurality of J sub-caches. The main memory stores pictures (e.g., images, frames, fields) and each picture has a width and height. The sub-caches comprise cache blocks of a given width and height less than the width and height respectively of the pictures stored in the main memory. A picture stored in main memory is divided into successive vertical stripes, each of the same width as that of the cache blocks. The vertical stripes are treated as groups of J successive vertical stripes. In each such group, one sub-cache corresponds to one vertical stripe. As portions of the main memory picture are read into the cache, adjacent portions of the main memory will be read into differing sub-caches. Similarly, when transmitting memory from the cache to the requesting processor, horizontally adjacent portions will be read from differing sub-caches.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The materials, methods, and examples provided herein are illustrative only and not intended to be limiting.
Implementation of the method and system of the present invention involves performing or completing certain selected tasks or steps manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of preferred embodiments of the method and system of the present invention, several selected steps could be implemented by hardware or by software on any operating system of any firmware or a combination thereof. For example, as hardware, selected steps of the invention could be implemented as a chip or a circuit. As software, selected steps of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In any case, selected steps of the method and system of the invention could be described as being performed by a data processor, such as a computing platform for executing a plurality of instructions.
It should be appreciated that certain aspects of the invention are not limited to the task of motion estimation in video encoding and can also be applied to the task of block matching or image area correlation for image recognition, robotic vision, still-image data compression, and other applications that may require searching for the best matching block of pixels or data values on a 2-dimensional image or data field. Still other aspects of the invention apply to cache memory devices generally, not limited to a specific application, while other aspects apply to memory access generally, not specifically within the context of systems that use cache memory.
The invention is herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in order to provide what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.
In the drawings:
The present embodiments comprise an apparatus and a method for a high-performance, multi-processing cache memory device. The cache can accept a plurality of requests from a requesting processor client for memory blocks from a main memory. The cache temporarily stores portions of the main memory in the cache memory and transmits the requested memory to the client. The handling of the memory requests is optimized through various means, including pooling of requests and selecting according to various criteria, parallel processing of requests throughout the cache, use of a content-addressable memory structure to maintain state and provide communication between the various cache components, and extensive pipelining of operations within components.
The principles and operation of an apparatus and method according to the present invention may be better understood with reference to the drawings and accompanying description.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.
As noted above, the invention addresses the needs of any memory-intensive application. It also includes special constructs in support of memory used to represent two-dimensional objects, such as, but not limited to, pictures and video frames. In the illustrative preferred embodiments described herein, the invention's use in the exemplary application of a video processing system is described, including its use for motion estimation and compensation. It is to be understood that the embodiments for these applications are described for the purpose of better understanding the invention and should not be construed as limiting the invention to these embodiments or applications.
Reference is now made to
The input pooling unit and controller allow the cache to multithread memory requests. For example, in a preferred embodiment when selecting a pending request from the input pool, the controller may be constructed in such a way that a request for memory that is already in the cache would be serviced rapidly, directly from the cache memory, without waiting behind a request that needs to be serviced from the main memory.
The controller may also be configured to weight priorities according to the time a request was received. As the age increases, the priority for handling the request is increased. This can help ensure that a request is not neglected.
The controller may also be constructed so as to adjust priority for processing when a request in the input pool overlaps a currently active request or another pending request.
Another criterion the controller may use is the location in the main memory of the requested memory block. In some embodiments, performance may be optimized by clustering requests. Other embodiments may assign higher priority to requests that reference differing memory banks.
Similarly, when the requested memory is present in the cache, the location of that memory within the cache memory may be a factor in prioritizing requests for selection.
Among the criteria that may be considered in certain embodiments is the concurrent occurrence of a write operation to a portion of cache memory. For example, selecting a request that does not collide with the write operation can improve the cache performance.
Many other criteria may also be considered when selecting requests from the input pooling unit depending on the embodiment.
In one preferred embodiment, the cache transfers partial results to the client as they become available and the client manages issues related to receiving that partial data. In another preferred embodiment, the cache incorporates an output pooling unit, which serves to buffer the interim partial results of multiple active requests and transfers the results to the client when certain conditions are met and according to certain criteria.
One condition that may be considered is to transfer the results only when the full result of the request is available. Alternatively, results may be transferred only in a given order. For example, the output controller can guarantee that rows of a rectangular memory request will be delivered to the client in order from top to bottom.
When the output pooling unit is divided into sub-units, the output controller can take into account the respective utilization of space in each sub-unit. For example, priority could be given to delivering data from a sub-unit that is the most full, thus freeing up space in that sub-unit for subsequent requests.
In the case where there are sub-units in the output pooling unit, the input selection controller can also consider the utilization of space within the sub-units, giving a higher priority to input requests that will ultimately be handled by output sub-units that are less full.
Additional examples of the operation of the cache with regard to these features and criteria are presented in the ensuing discussions of additional embodiments.
Another aspect of the invention is the storage method shown in
The storage method shown in
This method thus provides improved performance when accessing two dimensional data that may be accessed either in terms of consecutive rows or alternate rows. In both cases the memory can transmit the data in a single cycle. The above examples from video processing are not intended to limit the scope of the invention—this method is applicable to any storage of two dimensional data. Furthermore, it is not limited to embodiments pertaining to storage within a cache device nor to storage in systems incorporating a cache device.
Reference is now made to
Reference is now made to
The state information maintained in the MCB may include details of one or more active requests, such as address and extent of the requested memory and a related set of MCBEs. The MCB also maintains state information about the cache memory, such as whether requested memory, or a portion thereof, is in the cache (i.e., a cache hit) or has been requested from the main memory or is in transit from the main memory to the cache memory. It also maintains address mapping information, to map from an address space referencing the main memory to the corresponding locations within a cache memory address space. By maintaining this state information in a content-addressable memory structure as described, a cache memory device may operate with greater efficiency and efficacy.
For example, in a preferred embodiment, a cache sub-system processor responsible for looking up the presence of a given memory block may consult the MCB along with the conventional cache tag memories. Depending on the results, the look-up processor may update the MCB to indicate a cache miss. An independent subsystem processor responsible for fetching memory from the main memory into the cache memory can independently consult the MCB to rapidly find a cache miss to service. Once the data is transferred from the main memory to the cache memory, the MCB is updated to indicate it is now present. Yet another independent subsystem processor managing the transfer of memory from the cache can independently and rapidly consult the MCB to detect when an appropriate set of memory blocks is available for transfer. As shown by the cited example, the MCB enables efficient and independent operation of a variety of cache subsystem processors that are able to cooperate and communicate through the medium of the MCB.
In a preferred embodiment of the cache for use in systems dealing with two-dimensional data, such as still images, or frames from a motion picture, the MCB 40 may be arranged in an embodiment as shown in
Reference is now made to
Attention is now drawn to the portion 55 of the image 51 stored in main memory. This portion represents a single row of information of height Hcache-block from the image and is contained within the vertical stripes numbered (per the “Vertical stripe index”) 7, 8, 9, 10, 11, and 12 and corresponding to sub-caches 7, 0, 1, 2, 3, and 4, respectively. Thus, this portion of the image may be stored in six cache blocks. Each of the six cache blocks will be stored in its corresponding sub-cache. Thus, when all six cache blocks are present in the cache, it is possible to transmit this portion of the picture to a requesting client processor in one parallel operation, each sub-cache contributing its cache block. Furthermore, many embodiments are possible where various parameters may be tuned to achieve various cost-performance trade-offs. These parameters include, but are not limited to, the number of sub-caches, the size of cache blocks, the maximum size of a client request, number of pending requests, number of concurrent active requests, number of cache-blocks stored in each of each of the sub-caches, size of the sub-caches, associativity of each sub-cache, number of lookup pipes, number of tag memories (which can potentially be smaller than the number of sub-caches in certain embodiments), optional output pooling units, and number of pending backend requests.
Additional preferred embodiments of the present invention are now described in greater detail. These illustrative exemplary embodiments are intended to demonstrate how the various parts of the invention interact and to show how an operating cache device comprising these components may be built and operate. As stated above, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.
According to an embodiment of the invention, the cache logically divides reference pictures into a grid of small blocks (called picture blocks), each block being the size of a cache block. Thus the cache block dimensions match the picture block dimensions, each of which is a bx (horizontal)×by (vertical) array of pixels, each pixel consisting of bd (depth) bits. In a specific embodiment, the values of bx, and by may be optimized, for example by being made powers of two, smaller than a macroblock size. As a further optimization, the number of bits bx*by*bd in a cache block may closely match the number of bits in a single burst or a small number of bursts of the external memory storage interface, such as (but not limited to) a DDR memory.
According to this embodiment, the cache device stores Ncb bx×by cache blocks simultaneously in Nbanks internal cache memory banks. Each cache block can come independently from any one of many positions in any particular reference frame among a set of several simultaneous reference frames, the number of simultaneous reference frames being limited only by the amount of main memory dedicated to storage of said frames in a particular embodiment.
According to a further embodiment of the invention, various larger object shapes, not necessarily rectangular, can be stored in the cache by virtue of the fact that the images are divided into cache blocks of a finer granularity than cache devices of the prior art support. The effect of using many cache blocks of finer granularity is that common reference shapes, not necessarily rectangular in the aggregate tiling of cache blocks (e.g., circular or tree-like), are stored in the cache without special configuration. Furthermore, the use of the cache blocks adapts dynamically to the changing content of the motion picture stream being processed.
In a preferred embodiment used for video applications, the cache resides between a requesting processor, such as a video processing circuit (for example a motion estimation circuit), also called the “front-end client” or “client”, and a memory controller circuit connected to an external frame storage memory (such as DDR), together called the “main memory”, “main storage”, “external storage”, “memory backend”, or “backend”. The client makes requests of the cache for the purpose of reading reference pictures. One of the objects of this invention's cache design is to provide faster overall servicing, parallel handling, and maximize the amount of video reference picture data provided to the front-end client, while minimizing the amount of data requested from the backend, hence reducing the bandwidth requirements from the external memory.
In a preferred embodiment, the cache is configured to simultaneously hold portions of Nref reference pictures of various types (for example, frames, fields, luminance, and chrominance pictures), where each picture size is configured to a maximum size of Sx×Sy pixels.
In a preferred embodiment, it is useful to relate to several different addressing schemes, or modes, used to reference data in various units throughout the embodiment. For example, one or more of the following address spaces may apply when used in an embodiment for processing streams of motion pictures:
In an illustrative preferred embodiment, the communication flow proceeds as follows:
In a preferred embodiment, the cache supports Nactive simultaneous active client requests for reference areas. Partial overlap of the requests is automatically identified, so that if a particular cache block is referenced in two or more simultaneous requests and is missing in the cache, it will be fetched from the backend only once, for the benefit of all the requests.
Furthermore, if the processing of a particular request is stalled, waiting for the backend to complete a cache block transfer, the mechanisms of the present embodiments continue to service other pending requests that hit (i.e., are already present in) the cache, allowing the client to receive those other requests in the meanwhile.
An additional capability provided by the present embodiments is automatic support for partial or full out-of-picture motion vectors, which is used in newer compression standards such as MPEG-4 and H.264. The present embodiments perform field and frame-based pixel padding automatically for out-of-picture regions, removing this burden from the client. Thus, if a client requests an area that is partially out of the frame, the cache will satisfy the request in a manner transparent to the client.
The cache also supports special storage and accessing modes for interlaced video contents, while providing the client with a simple interface to receive a field or frame region, regardless of the storage format of the reference picture or pictures involved in that request. For example, a top field reference area in an interlaced coded content can be extracted out of a frame picture, or vice versa—a frame area can be constructed by combining different fields of different frame pictures, which can be used, for example, to aid in performing a real-time 3:2 pulldown process. Those modes will be explained in more detail below.
For purposes of illustration only, the exemplary preferred embodiment of the present invention described in the drawings uses the following parameters:
a cache block of size bx=8, by=4, with bd=8-bit pixels;
Nbanks=8 sub-caches, each of them Na way associative;
Na=4 (i.e., the cache is 4-way associative);
Ntags=8 tag memories;
maximum picture size Sx=1920, Sy=1088;
maximum client request size 28 by 28 pixels;
Nactive=4 simultaneous active client requests;
Npending=4 pending client requests in input pool
N1p=5 lookup pipes;
and
client transfer rate of two lines of 28 pixels each simultaneously in one clock cycle.
It is understood that many configurations of these and other parameters are possible within the scope of the present invention and different embodiments may reflect this variety.
Once the motion estimator 125 decides on the best candidates, the reference area is usually interpolated by the sub-pixel interpolator 140, passed to a frequency domain transformer and quantizer 145, and then entropy coded using variable length, arithmetic, or other means by the video entropy encoder 150. The encoder 150 provides the final compressed video bitstream output of the overall video encoder circuit.
In parallel, the frequency domain transformer output is passed through an inverse quantizer 155, an inverse frequency domain transformer 160, and stored back in the storage 115 using the backend controller module 110.
As part of the video encoding process, each macroblock of the current picture is divided into smaller sub-blocks. Then, the motion estimator 125 needs to find the best reference sub-block for each current sub-block among several frames.
With reference to
Each picture is divided into small bx×by blocks, 8×4 in the present example. This area is smaller than the typical 16×16 macroblock size, as can be seen in
For example, the vertical stripes 355 and 360 map into sub-cache 305, while stripes 365 and 370 map into sub-cache 310. In an embodiment where the maximum horizontal client request size is smaller than Nbanks*bx pixels, this scheme ensures that the client-requested horizontal pixels are always present in the cache simultaneously and enables the cache to transfer one or more complete rows 380 from the reference area 375 back to the client in a single clock cycle.
The above group of picture block elements that map to a particular single sub-cache (such as 355 and 360) contend for placement inside the same sub-cache (in this example, 305). (Note that adjacent vertical stripes (such as 355 and 365) map to different sub-caches and do not contend.) To alleviate the contention, a programmable two-dimensional to single-dimensional mapping is performed, along with further division of each of the sub-caches into several associative sets, (four in the preferred embodiment).
With reference to
Blk_x::5 bits (after dividing by 8 to discard 3 low-order bits)
blk_y::9 bits
color::2 bits
pic_nr::4 bits
It is understood that this particular mapping is dependent on the various parameters chosen and would be adapted accordingly to fit other embodiments.
To further alleviate block contention, the single-dimension virtual address is passed through a programmable multiplexer shuffling network 435, which performs a permutation of the original concatenated bits. This allows fine-tuning of how blocks in the original vertical stripes group are mapped to the physical address space, and which blocks map to the same physical address. For example, in one embodiment, the mapping used can be [5:0], [6:1], [7:2], [8:3], [0:4], [1:5], [2:6], [3:7], [4,8], [9:9], [10:10], [11:11], [12:12], [13:13], [14:14], [15:15], [16:16], [17:17], [18:18], and [19:19], which maps bit 5 into bit 0, bit 6 into bit 1, etc. This particular mapping, for example for a preferred embodiment, improves the cache hit ratio by mapping locally vertically adjacent two-dimensional blocks, as well as nearby horizontal (i.e., successive horizontal blocks spaced eight blocks apart in this embodiment) into different physical locations in each sub-cache, thereby reducing the incidence of collisions.
Each sub-cache is further divided into Na associative sets (4 in the preferred embodiment shown) such that up to Na virtual blocks that happen to map into the same physical address may simultaneously co-exist in the cache, which further alleviates block contention.
After creating a one-dimensional virtual address and mapping it as described above, the least significant bits 445 (six in the preferred embodiment shown) are used as the block physical address, and describe the location of that block inside the physical cache data memories. The remaining 14 bits 450 are used to differentiate between the various virtual blocks that can be mapped to the same physical location in the manner commonly employed in the art for n-way associative cache memories with the use of tags.
The cache architecture in a preferred embodiment is described below from the point of view of the actions performed in response to a single client request from the moment it is received by the cache until the moment the cache satisfies the request. It should be born in mind, however, that multiple client requests can be active simultaneously, and each one can be in a different processing stage. In that regard, the invention's cache micro architecture is designed as a pipeline; although there is an initial latency, afterwards data is streamed to the client at a very high rate—much faster than 1/latency (up to 140 Gbits/sec in one particular experimental embodiment).
For each sub-block, the client 505, assumed to be a motion estimator in this example, passes to the cache an area request 507 for an area of memory. A request can be submitted every clock cycle.
Each request contains the following arguments:
The request, along with its arguments, thus describes a rectangular reference area in one or two of available reference pictures, as well as the way in which this reference area is to be delivered to the client.
Each of the arguments of the area request is now described in greater detail.
Argument 1, the picture number, contains an index that refers to one of a group of 16 (in this embodiment) reference pictures. Each reference picture has an associated frame descriptor that resides in the configuration block 590. The descriptor describes the picture size, storage format, and external storage address at which it can be found. In this embodiment, two groups of reference pictures are maintained, one group for the luminance component of the pictures, and one group for the chrominance Cb/Cr color components, selected by argument 2—color.
Arguments 3 and 4 locate the top-left corner of the requested rectangular reference area inside the picture specified by arguments 1 and 2. Arguments 5 and 6 describe the width and height (horizontal and vertical extent) of the requested rectangular area.
Argument 7, the request type, specifies the action for the cache to perform on the reference area. In this embodiment, it can be either
1. read, or
2. bring_cache.
The read request type is a common request type used by the client to ask for a particular reference picture area from the cache. The bring_cache request type, on the other hand, asks the cache to load the specified rectangular area into the cache, without providing it to the client (i.e., a pre-fetch or pre-load action). This mechanism can be used by the client to reduce the cache miss penalty, by interleaving bring_cache requests for future reference areas in parallel to read requests for current reference areas. When those areas are needed, they will already be available inside the cache.
Moving picture video can be either interlaced or progressive (also called full frame). In progressive video, an entire frame is generated at once (for example, film material is usually shot at a rate of 1/24 frames per second, thus an entire frame is shown every 1/24 second). On the other hand, an interlaced frame comprises two fields, one containing the even scan lines of the picture while the other contains the odd scan lines of the picture. Each interlaced field in a frame has a time offset from the other field. For example, in NTSC video, each field is at approximately 1/60 second intervals. Thus, within a frame the time difference between the even and odd fields is 1/60 second, versus 0 in the progressive case, i.e., in progressive video, both fields are present together.
One aspect of the present invention's cache is that it can be optimized for both interlaced and progressive applications. This is achieved by making the basic bx×by cache block (8×4 in the present example) contain by lines from either a single field, or both fields (i.e., from a single progressive frame). The backend storage can, in a similar manner, be optimized either for field or frame storage by deciding whether its basic block, typically, the amount of memory delivered in a single burst, will contain data from a single field or from both fields. Additionally, a video can switch between interlaced and progressive content, and occasionally a field can be dropped altogether, as done in cadence correction (3:2 pulldown processing of film material which was pulled up from cinema at 24 frames per second to NTSC television at ˜30 frames per second). Video compression art therefore supports both interlaced and progressive coding tools for each macroblock, and the cache client may want to consider both frame and field reference areas.
It adds considerable complexity to the client to support the various ways in which a field can be extracted from a frame, or a frame can be combined from several fields. The present embodiment of the cache therefore performs extraction or combining of fields for the client, and uses argument 8, the request subtype to communicate the desired action. The client may use the request subtype to specify the method in which interlaced and progressive video storage formats are to be handled before being returned to the client. The method may be one of:
1. frame_from_frame,
2. field_from_frame,
3: field_from_field,
4: frame_from_fields,
5: frame_from_field_frame,
6: frame_from_frame_field,
7: frame_from_frames.
The handling of these different request subtypes is illustrated with reference to
Arguments 9-11 are sub-arguments for the subtype argument, as described above. Finally, argument 12, request id, is a client-supplied identifier associated with the request.
In this preferred embodiment, data will be returned to the client in a sequential vertical manner, one or two lines at the same clock cycle, each line consisting of REQ_SIZEX pixels, at most 28 in the present example. When returning data back to the client, identifier argument 12 will be sent along with the data, and may be used by the client to identify to which of its requests the currently supplied line belongs.
Returning to
The cache keeps track of and simultaneously handles Nactive requests (in the current example, Nactive=4). So from the client's point of view, 8 request slots are available (Npending+Nactive). The plurality of input request slots, mentioned above as an input request pooling unit, alleviates the single request miss latency in the case where a request must be serviced from the external storage, since other requests can be transferred to the client during this time. Thus, utilization of the cache-to-client transfer interface is improved, avoiding the bandwidth penalty associated with the backend storage.
Upon arrival at the pending pool, the rectangular block request is aligned with the grid of bx×by blocks, and field/frame co-ordinates are adjusted according to the request type and storage format of the reference pictures used in that request by the acceptor block 510.
Each active request is assigned a processing state, which in this preferred embodiment is one of LOOKUP, ASK_FINISHED, LOADING, and TRANSFER. Each active request transitions between those states in the order listed, subject to the condition that at a particular time, only a single request can be in the LOOKUP and ASK_FINISHED states, but several active requests can simultaneously be in the LOADING and TRANSFER states.
In a preferred embodiment, whenever a new active request slot is available and the lookup pipes are idle, and provided that at least one request is pending, the oldest pending request is activated by the activation block 520. In other preferred embodiments, other criteria may also be applied, for example, priority may be given to requests to read over requests to pre-fetch (bring_cache), or priority may be given to a request for memory representing a luminance image over a request for a chrominance image. In so activating, the request is transferred to an active slot (i.e., an available plane) in the MCB 525 with the state being assigned to a LOOKUP state. In some preferred embodiments, criteria for selecting a particular MCB plane may be employed, for example, if optional output pooling units are employed, the state of the output pooling units may taken into consideration.
Reference is now made to
The MCB contains storage elements 715 (MCB Elements or MCBE) that keep track of the on-going activity in the cache. The MCB is a three-dimensional cube 710 where each plane 720 of the MCB relates to a rectangle 725 of a reference picture 730. The rectangle so referenced comprises a matrix of Rx×Ry blocks, where each block is bx×by pixels in size. As noted earlier, the block size imposes a grid structure on the picture (i.e. the total number of horizontal blocks is roundup (Sx/bx) and the total number of vertical blocks is roundup (Sy/by). Each rectangle 725 is aligned to this grid by adjusting the address outward in all four directions as needed. Each MCBE relates to the corresponding cache block containing the corresponding portion of the picture.
A unique feature of the MCB is that, in contrast to a standard memory, many storage elements can be accessed and modified simultaneously in one clock cycle, with access patterns unique to the needs of this embodiment's cache system.
In the presently described preferred embodiment, the MCB cube comprises four planes. Each of the four planes corresponds to one active request. Each plane comprises per-plane header information related to the corresponding request, such as its location and extent, picture number, color, request id, and timestamp. The exact header information required varies depending on the embodiment and parameters. Each plane in the presently described embodiment further comprises 40 storage elements arranged as a matrix of 5 horizontal×8 vertical MCBEs representing storage of 40 pixels horizontal×32 pixels vertical. This size is dictated by the choice of cache block size and maximum request size. As stated above, in this embodiment the request block maximum size is 28×28 pixels. Horizontally, the request block may start anywhere within a grid block, e.g., at the seventh pixel. In this example, with a 28-pixel wide request, the width would extend horizontally to include the first two pixels of the fifth consecutive block. Similarly, in the vertical direction, allowing for any alignment of the requested block, the maximum vertical size is 32 bits, yielding the requirement for an array of 5×8 MCBEs to minimally cover the area which may contain the maximum-sized request.
With four planes there is a total of 160 MCBEs in the MCB cube. Each MCBE describes one cache block, which as noted is 8×4 pixels. Thus, in this embodiment, there is a total of 1280 (40×32) pixels described by a single plane and 5,120 (40×32×4) pixels described by the entire cube.
Each MCB storage element (MCBE) is a descriptor for the MCBE's associated cache block. Each MCBE contains the physical block address and associativity index. The physical block address and associativity index defines the associated cache block's position inside its local sub-cache. The MCBE further comprises state information that describes the current action being performed on the referenced cache block (one of: PENDING, LOOKUP, HIT, MISS, or LOADING in the presently described embodiment).
In the presently described preferred embodiment, the MCB supports the following set of six write operations:
Multiple operations can be done simultaneously in a single clock cycle, and every operation can affect or process multiple MCBEs simultaneously. Some operations are MCBE-oriented, some are plane-oriented, and some are cube-oriented, affecting all storage elements.
Each of the operations is now described in greater detail with reference to
Operation W1, init_pending 740, simultaneously initializes all the MCBEs in a plane to the pending state.
Operation W2, init_hit 745, simultaneously initializes all the MCBEs in a specified MCB plane to the right of the specified x coordinate (inclusive) to the HIT state, as well as all the MCBEs below the specified y coordinate (inclusive) to the HIT state, leaving only the upper left corner of the plane (extending down and to the right to the coordinates (x−1, y−1) as PENDING. W1 and W2 are used by the activation block 520 when activating a new pending request. W2 is preferably used when the rectangle of interest is smaller than the whole plane. Applying the HIT state for the “don't care” region means that the whole plane will register as HIT when the actual area of interest is truly HIT. This is effectively an optimization of the described embodiment to simplify the circuitry for later recognizing that the row or the plane is completely hit (i.e., R2, R3). In this embodiment, the W1 and W2 operations execute simultaneously in single clock cycle.
Operation W3, update_state_row_lookup 750, initializes an entire horizontal line of a specified MCB plane to the LOOKUP state.
Operation W4, update_row 755, updates an entire row of a specified plane with new block address and associativity index, as well as setting the states of that row to one of the LOADING, HIT or MISS states.
Operation W5, update_hit 760, is a cube operation working on the entire MCB. It is used to update all the MCBes that point to the same virtual picture block (i.e., which share the picture number, color, and picture block co-ordinates), to the HIT state. The effect of this operation is to allow the cache of the present embodiments to support overlaps of simultaneously active client requests. In
As noted, each active client request is assigned a plane in the MCB, and all elements inside a single MCB plane are ensured not to overlap (since they reference discrete tiled blocks in the picture). However, it is possible that the blocks referenced by elements from one plane in the cube overlap those referenced in another plane as illustrated in
Operation W6, update_loading 765, modifies the state of a single MCBE to the LOADING state.
Operation R1, first_miss 770, is an MCB cube operation to find the first storage element that is set to a state of MISS. It selects and returns the first MCBE whose state is set to MISS. “First” is defined in this embodiment as the first MCBE (as would be found when stepping through a plane in raster scan fashion) of the plane corresponding to the oldest request.
Operation R2, line_hit 775, returns TRUE when all the MCBEs in the specified row of the specified plane are in the HIT state.
Operation R3, all_hit 780, returns TRUE when all the MCBEs in the specified plane are in the HIT state.
Operation R4, transfer 785, For a specified row in a specified plane, returns the MCBEs in that row.
Operation R5, tag_usage 790, is a cube operation that, for a specified physical block address, returns the “in-use” state of all four associativities (per the presently described preferred embodiment). By “in-use” is meant, “is this associativity of this physical address referenced by any of the MCBEs in the cube?” This operation is used when all associativities for a given physical block address are occupied and one must be selected for overwriting. This operation is used to ensure that a currently in-use block will not be selected.
Operation R6, lookup_hit_loading 795, is a cube operation that returns TRUE when for a specified virtual picture block, there is any referencing MCBE currently with MISS or LOADING state.
Returning to
Reference is now made to
The lookup pipes use MCB operation W3, update_state_row_lookup for initialization of the lookup procedure, and then construct the virtual address of each of the blocks in the horizontal line, map them using the shuffling network 410, and divide them into physical address and tag as described earlier in
When two active requests are for areas that overlap in the virtual address space, MCBEs in two planes are associated with the overlapping cache blocks. The first time a missing cache block is encountered, the MCBE associated with that encountered cache block is assigned a MISS state. The miss state is determined by checking the tag memories. If the tag memory indicates that the block is missing from the cache, the MCBE is assigned to MISS and the tag memory is updated to TAG_HIT (as distinguished from the MCBE HIT state).
When a subsequent MCBE is encountered and references (i.e., is associated with) the same cache block (whose MCBE was previously marked MISS), the lookup engine determines from the tag memory that the desired cache block has already been encountered. The MCB cube operation R6, lookup_hit_loading is used to determine if the block is actually in the cache (i.e., the MCBE state is HIT) or if the block is still in transit (i.e., the MCBE state is MISS or LOADING for at least one other co-existing MCBE). If the block is not yet actually present in the cache, that subsequent MCBE will be marked LOADING, which indicates that the block is, or soon will be, in transit from the storage backend into the cache, due to an earlier encountered MCBE associated with the same cache block already being marked MISS.
It should be noted that in this embodiment a rare case could potentially cause a deadlock when R6 and W5 are executed in the same cycle. The rare case occurs when an incoming block from another request has just arrived in this cycle and the lookup engine 530 is trying to read the state of MCBEs associated with a current cache block just as the backend data acceptor 550 is updating all MCBEs associated with that cache block to the HIT state. To prevent this deadlock, the storage acceptor block has a bypass logic connected directly to the lookup pipe which detects this case by checking whether the current cache block has just, in this cycle, been brought into the cache, and then modifies the current cache block's MCBE to HIT regardless of R6 saying otherwise.
Cache block replacement is implemented in this preferred embodiment using a least-recently-fetched (LRF) policy. To select a suitable candidate to be replaced out of the available associativities, each candidate is checked via its tag whether it is present, and MCB cube operation R5, tag_usage, is invoked to find out whether any of the currently active requests being handled by the cache is using that particular candidate. If so, that candidate is rejected from consideration. Then, the best candidate is chosen, taking into account the indicator of which block was last brought into the cache, so that older blocks will be replaced first.
The above operations are performed in a pipelined fashion, with a new line of blocks handled at each clock cycle. Lookup starts on a particular line even before the previous line has completed its lookup.
It should be noted that in embodiments where Nactive<=Na, there is always an associativity available, provided the shuffling network has not been programmed to map multiple blocks in one request to the same physical address. In embodiments where these conditions are not met, cache block replacement may have to wait until an associativity is freed.
To reiterate, the present embodiments permit multiple lines within the same active request, and blocks within the same line are guaranteed not to contend in the cache, enabling the lookup pipe to proceed with no risk of stalling due to collisions or resource contention. This is ensured, for example by sub-dividing the overall cache structure into eight banks (sub-caches) that divide the horizontal address space into eight independent vertical stripes (repeated periodically), and by designing the hash mapping function used by the shuffling network such that adjacent vertical blocks in the virtual address space map into different physical addresses.
Moreover, in some embodiments, lookup on different active requests can proceed in parallel. As long as the number of active requests Nactive (four in the current example) is less than or equal to the associativity (4-way in the current example), it is ensured that each particular physical address is not used more than Nactive (4) times, again allowing the lookup pipes to operate without the possibility of having to wait for cache memory to become available for replacement.
Other embodiments of the invention are possible where the above-stated condition does not hold (i.e., Nactive>Na). For example the embodiment may provide a larger number of active requests (for example, 8), while avoiding higher associativity settings (e.g., using 4-way associativity). In that case, a resulting collision can be handled by adding a WAIT state to each cache block's MCBE. The WAIT state may be used in the case where a miss is detected, but no associativity is available for replacement. The presently described embodiment then waits until such a cache block exits the cache on its way to the client (i.e., satisfying all active requests for it), freeing the corresponding associativity. The freed associativity is then used for a cache block waiting for it as indicated by the WAIT state.
Returning to
When several miss candidates are simultaneously contending for service, an arbiter is used to select the best candidate, taking into account which active request is the oldest, and for a particular request, scans the blocks in the request in a raster scan fashion. By fetching the blocks in raster scan fashion, the lines tend to become available to the client in top-to-bottom order.
The miss logic is internally pipelined as well, being able to schedule a new memory block request every cycle, at a rate faster than the rate at which the backend memory is able to service the requests. Due to this, blocks that miss and are close in the picture are requested in close proximity in time, and since they are typically stored in the memory storage close to each other, memory bank switch activation penalty (e.g., as in DDR) will be minimized. Each such request to the memory is also classified with a priority indicator, which can be lowered if necessary, to prevent overwhelming the memory storage with too many requests in case this is an issue.
In parallel to the miss logic subsystem 540 issuing new requests to the backend module 110 (545), the backend module services the requests, and returns block data through the storage data acceptor subsystem 550. For each returned block, the acceptor module changes the returned block's state from MISS to HIT in the MCB. Due to the parallel processing of several active requests, when there is region overlap it may happen that several active requests wait for a particular block to arrive. As described in the lookup subsystem, only the first such block has MISS state and causes a request to the storage backend module, while subsequent requested physical blocks mapped to the same virtual block are assigned the LOADING state and partake of that request. However, all waiting requests need to be notified of block completion, which is done using MCB cube operation W5, update_hit.
The backend data acceptor subsystem 550 writes the returned data into the main data cache memories. In a preferred embodiment of the invention, the cache is divided into eight sub-caches, or banks, corresponding to the repeating eight vertical stripes division of the input frames as described above with reference to
In a preferred embodiment, a block corresponds to 256 bits (8 columns×4 rows×8 bits per pixel) and is returned from the external storage in 2 clock cycles, 128 bits in each cycle—the first cycle returning rows 0 and 1, and the second cycle rows 2 and 3. Additionally, each memory bank is further divided to two sub-banks, each 64 bit wide, and data is written into the memory in the first cycle in direct fashion—(0,1), but in reversed order in the second cycle—(3,2), so that lines 0 and 3 (or 1 and 2) share the same memory. The allows the transfer logic to read two lines simultaneously when transferring data to client both in the frame format, in which case lines (0+1) and (2+3) need to be simultaneously transferred, and also in field formats, where lines (0+2) or (1+3) are simultaneously transferred.
In parallel to the data acceptor module updating the data memories and the MCB, the client transfer module transfers available data from the cache to the client. The first stage of the transfer module is an arbiter, selecting the active request to be handled. For each handled request, data is transferred to the client two lines at a time. The internal cache division to 8 sub-caches spanning 64 horizontal pixels, and the fact the horizontal request size (including aligning to cache blocks grid) is guaranteed to be smaller than the horizontal span, per the preferred embodiment, guarantees that data for the entire line will be present simultaneously, and can be simultaneously read from the single port data memories in a single clock cycle. The additional sub-division of each bank to two sub-banks, and the reverse order of the lines in each elemental block, guarantees that two lines can be simultaneously transferred in a single clock cycle.
The best active request is first chosen by using MCB operation R2, line_hit, for all active requests to filter out the requests that already have their current line data available in the cache. Then the remaining list is filtered against on-going write commands to the same memory banks that have to be read from, as explained earlier in the data acceptor module. Finally, similar to the miss logic, the best candidate out of the remaining list is chosen by taking into account the arrival time of the active requests from the client, i.e., giving priority to the oldest active request.
Once the best candidate is chosen, one or two lines of data are read from the data memories. In some embodiments, the data is transferred directly back to the client, along with the indication of which request number it corresponds to (request id), and which line it is from the request.
In another embodiment, the data is first passed through luma/chroma output pooling units that are placed between the client transfer module and the client. The luma/chroma pooling units serve two purposes. One purpose is to reorder the active request lines so that each request can be transferred to the client in its entirety at a fixed rate once all lines have been gathered, in case the client cannot handle the inter-mixing of lines from various requests. The second purpose is to allow the client to postpone the data transfer to it in case it is not able to handle the bandwidth due to its own limitations. For that purpose, a STALL signal is issued by the client to the cache per pooling unit, causing the data transfer from the pooling unit to the client to stop immediately. Meanwhile, the requests that are still being worked on in the client transfer logic pipeline can be buffered by draining into the pooling units. The client transfer logic would then assess the state (i.e., capacity utilization) of the pooling units before it selects new active requests to work on. Similarly, the activation block 520 may consider the utilization of the output pooling units in order to give priority to an output pooling units that is less full than other output pooling units. Also similarly, the output pooling units' output logic 570 and 580 may give priority to units that are more full so as to free space.
In a preferred embodiment shown in
Meanwhile, the queue's output logic (570 or 580) transfers a completed entry from the head of the queue to the client. After transfer, the space allocated for the entry is freed for re-use in the circular queue.
As an additional feature of a preferred embodiment, each output pooling unit may be divided into two sub-unit, one sub-unit for even request lines, and one sub-unit for odd request lines. By dividing into two sub-units, two lines may be written simultaneously, one to each sub-unit.
To illustrate how the output pooling units work, consider the following simplified exemplary embodiment. Each output pooling unit has a write address (WA) and read address (RA). Data is written into the unit by the transfer logic 560 a line (or two) at a time (possibly not in order as described above) such that line i is written to address WA+i (with appropriate logic to map lines to the even or odd sub-unit, as described above). At each cycle the line offset i, may change. When all the lines of a particular active request have been written to the entry, the write address WA is incremented by the request size, effectively starting a new entry.
From the output logic 570 and 580 side, read access is performed in a strictly sequential manner, reading the data 1 or 2 lines at a time and incrementing the read address RA after each read.
For the benefit of clarity, we have described various embodiments of the present invention in the context of a typical use for motion estimation. After considering the description, those skilled in the art will realize that in addition to the detailed examples described, the present invention can be used in different configurations of the various parameters (such as bx, by, Nbanks, etc), as well as in many other video and image processing applications or other computing applications unrelated to image processing that could benefit from reduction in the access times to memory. Examples of some other applications that require high memory bandwidth during processing of motion pictures or images are: video image enhancement, video pre-processing, robotic vision, pattern matching, image recognition, and display processing, or any other process that requires repeated access to many pixels.
It is expected that during the life of this patent many relevant devices and systems will be developed and the scope of the terms herein, is intended to include all such new technologies a priori.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination.
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. All publications, patents, and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention.