[Not Applicable]
High Definition (HD) displays are becoming increasingly and popular. Many users are now accustomed to viewing high definition media. However, a lot of media, such as older movies, and shows were captured with Standard Definition (SD). Since the actual scene was captured by a video camera that only captured the scene in standard definition, even if the display is high definition, there are not enough pixels to take advantage of the display.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with some aspects of the present invention as set forth in the remainder of the present application with reference to the drawings.
The present invention is directed to system(s), method(s), and apparatus for a caching structure and apparatus for use in block based video processing, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.
These and other advantages and novel features of the present invention, as well as illustrated embodiments thereof will be more fully understood from the following description and drawings.
Referring now to
The fast memory 25 can comprise, for example, an on-chip memory or a cache. In general, the on-chip memory or cache is faster, but more expensive. It is not generally feasible to store an entire other object 15 on the fast memory 25. The bulk memory 30 can comprise, for example, off-chip memory. In general, the off-chip memory is slower to access, but less expensive and capable of storing entire other objects 15.
Portions of the object, e.g., portion 10(n), are limited from being directly affected by portions from objects 15 that are beyond a certain range R from the portion of the object 10(n). However, any portion of other object 15, e.g., 15(z) can potentially affect any portion of object 10(n), via a cascade of objects 15(n . . . z), each of which overlapping another until portion 10(n) is affected. However, a diffusion range D surrounding object 10(n) is defined, wherein objects that directly affect a portion of object 10 that is outside of the diffusion range D are not considered with respect to portion 10(n). Thus, processing portion 10(n) can be performed using only a particular portion of other object 15.
It is noted that the diffusion area D for portion 10(n) overlaps portion 10(n+1). Accordingly, some of the portions 15 that directly affect portion 10(n) and the diffusion area D also affect portion 10(n+1). Accordingly, while processing portion 10(n), portions of other object 15 that are found to affect portion 10(n+1) are stored in the fast memory 25 to reduce access time.
The foregoing can be used in a variety of applications where objects are processed in portions. For example, object 10 and other objects 15 can comprise video frames. The system can be used to process frames by either deinterlacing or increasing frame resolution. An exemplary embodiment of the present invention wherein the present invention is used to increase the resolution of video frames will now be described.
Referring now to
It is noted that position x,y are discrete variables, that actually correspond to a range xΔx−0.5Δx->xΔx+0.5Δx, yΔy−0.5Δy->yΔy+0.5Δy, in both the scene and the picture, where Δx*Δy are the dimensions of the pixel. An exemplary standard for frame dimensions is the ITU-R Recommendation Bt.656 which provides for 30 frames of 720×480 pixels per second. Additionally, the pixel value of 125(x, y) is also a discrete value. For example, 24-bit color uses 256 red, 256 blue, and 256 green color values to represent the range of colors that are visible to the human eye. It is noted, however, that a variety of different color standards can be used.
While the video frames 120 comprise discrete pixels at discrete locations, a real-life scene that is captured is continuous in color and space. Thus, while the position in a scene corresponding to pixel 125(x, y), xΔx−0.5Δx->xΔx+0.5Δx, yΔy−0.5Δy->yΔy+0.5Δy is a range that may include several colors. The colors themselves may not necessarily match exactly with any one of the 24-bit colors.
However, the actual color that is recorded by the camera can be modeled as some type of statistical averaging of the colors that appear between xΔx−0.5Δx->xΔx+0.5Δx, yΔy−0.5Δy->yΔy+0.5Δy. The averaging can be a simple averaging of the colors or weighted averaging based on the distance of the point and color from the center x, y. A particular one of the 24-bit colors is selected that most closely approximates the actual color.
The differences between adjacent colors in 24-bit colors are indistinguishable to the human eye. Accordingly, adjacent colors appear continuous. An exemplary standard for display of the video sequence 105 is the ITU-R Recommendation Bt.656 which provides for 30 frames of 720×480 pixels per second. The foregoing picture appears spatially continuous to the viewer. However, although 720×480 pixels appear continuous to the user, information is lost from the original scene, resulting in a loss of detail. For example, fine texture in the scene may be lost.
Referring now to
Thus pixels 225(x,y) are discrete variables, that actually correspond to a range 0.5×Δx−0.25Δx->0.5×Δx+0.25Δx, 0.5yΔy−0.25Δy->0.5yΔy+0.25Δy, in both the scene and the picture, where 0.5Δx*0.5Δy are the dimensions of the pixel. As in the case of lower resolution, the pixel value of 225(x, y) is also a discrete value. For example, 24-bit color uses 256 red, 256 blue, and 256 green color values to represent the range of colors that are visible to the human eye.
The position in a scene corresponding to pixel 1125(x, y), 0.5×Δx−0.25Δx->0.5×Δx+0.25Δx, 0.5yΔy−0.25Δy->0.5yΔy+0.25Δy is also a range that may include several colors. The colors themselves may not necessarily match exactly with any one of the 24-bit colors. The actual color that is recorded by the camera can be modeled as some type of statistical averaging of the colors that appear between 0.5×Δx−0.25Δx->0.5xΔx+0.25Δx, 0.5yΔy−0.25Δy->0.5yΔy+0.25Δy. The averaging can be a simple averaging of the colors or weighted averaging based on the distance of the point and color from the center x, y. A particular one of the 24-bit colors is selected that most closely approximates the actual color.
The foregoing higher resolution picture more accurately captures the scene and provides greater detail, including finer texture than the lower resolution picture. However, a lot of media, such as older movies, and shows were captured with in Standard Definition (SD), while high definition displays are becoming increasingly common. It is noted that other resolution changes are also possible.
When a scene is captured in lower resolution, although the continuous detail of the scene is not known, information about the scene as a series of ranges xΔx−0.5Δx->xΔx+0.5Δx, yΔy−0.5Δy->yΔy+0.5Δy is known. The image of
Nevertheless, the foregoing information can be estimated by up-sampling the low resolution frame using any one of a variety of techniques such as spatial interpolation, or filtering. The foregoing results in an estimated higher resolution frame. Exemplary upsampled frames 320 that estimates the higher resolution frame is shown in
The foregoing can be done with each of the low resolution frames that are captured at other times, e.g., t−3, t−2, t−1, t, t+1, t+2, t+3 . . . , resulting in upsampled frames 320t−3, 320t−2, 320t−1, 320t, 320t+1, 320t+2, 320t+3. However, it should be noted that with recursion, the processing for higher resolution frames prior to 320t was completed prior to processing of frame 320t. Accordingly, these frames are now designated 320t−3′, 320t−2′, 320t−1′. Frames 320t+1, 320t+2, 320t+3 are not yet completely processed.
Information from proximate time periods can be used to improve the quality of frame 320t. The foregoing will now be described with reference to
ME stage 1: In the first stage, details of which are shown in 410, motion estimation is performed between pairs of neighboring frames 320t−3′ and 320t−2′, 320t−2′, and 320t−1′, 320t−1′ and 320t, 320t and 320t+1, 320t+1, 320t+2, 320t+2 and 320t+3. For each pair of neighboring frames, two motion estimations are performed.
In the first motion estimation, the earlier frame is the reference frame and divided into predetermined sized blocks, e.g., 320t−1′. The later frame 320t is the target frames and is searched for a block that matches 320t−1′. In the second motion estimation, the later frame is the reference frame and divided into predetermined sized blocks, e.g., 320t. The earlier frame 320t−1′ is the target frame and is searched for a block that matches 320t.
Motion estimation in this stage is based on full-search block matching, with (0, 0) as search center and a rectangular search area with horizontal dimension search_range_H and vertical dimension search_range_V. The reference frame is partitioned into non-overlapping blocks of size block_size_H×block_size_V. Next, for a block R in a reference frame with top-left pixel at (x, y), the corresponding search area is defined as the rectangular area in the target frame delimited by the top-left position (x−0.5*search_range_H, y−0.5*search_range_V) and its bottom-right position (x+0.5*search_range_H½, y+0.5*search_range_V1), where search_range_H and search_range_V are programmable integers. Thereafter, in searching for the best-matching block in the target frame for the block R in the reference frame, R is compared with each of the blocks in the target frame whose top-left pixel is included in the search area. The matching metric used in the comparison is the SAD between the pixels of block R and the pixels of each candidate block in the target frame. If, among all the candidate blocks in the search area, the block at the position (x′, y′) has the minimal SAD, then the motion vector (MV) for the block R is given by (MVx, MVy) where MVx=x−x′, and MVy=y−y′.
As noted above, with recursion, the processing of frames 320t−3′, 320t−2′, 320t−1′ is completed. While frames 320t−3′ . . . 320t+3 are a window for 320t. During processing of 320t−1′, the upsampling was performed for all of the time periods except t+3, and motion estimation would be performed for all of the foregoing pairs except for 320t+2 and 320t+2. All the other motion estimation results are available from previous processing due to pipelined processing of consecutive images. Thus, only the foregoing motion estimation needs to be computed at this stage, provided the previous motion estimation results are properly buffered and ready to be used in the next two stages of motion estimation.
After the first stage of motion estimation, the next two stages are preferably performed in the following order at frame level: first, stages 2 and 3 for 320t−2′ and 1320n+2, then stage 2 and 3 for 320t−3′ and 320t+3.
ME stage 2: In this stage, details of which are shown in 420, the motion vectors between non-adjacent frames are predicted based on the available motion estimation results. The predicted motion vectors will be used as search centers in stage 3. For example, the predicted motion vectors between 320t+2 as the reference frame and 320t as the target frame, can be represented as C_MV(t+2, t). To determine C_MV(t+2, t), MV(t+2, t+1) and MV(n+1, t) are combined, both being available from the previous stage of motion estimation processing.
For example, as shown in
The predicted motion vector for R from 320n+2 to 320n may be set as the summation of the motion vectors for the block R from 320n+2 to 320n+1 and the median of the motion vectors for the block T from 320n+1 to 320n, as shown in Equation 1:
C
—
MV(n+2, n,x,y)=MV(n+2,n+1,x, y)+median(MV(n+1,n,xi,yi), i=0, 1, 2, 3) (1)
where the median of a set of motion vectors may be the motion vector with the lowest sum of distances to the other motion vectors in the set. For example, consider each motion vector in the set as a point in the two dimensional space, and calculate the distance between each pair of motion vectors in the set. The median of the set may then be the motion vector whose summation of the distances to other motion vectors is minimal among the motion vectors in the set. Note that in other embodiments, the distance between two motion vectors may be calculated as the Cartesian distance between the two points corresponding to the two motion vectors, or it may be approximated as the sum of the horizontal distance and the vertical distance between the two motion vectors to reduce computing complexity.
Similarly, the predicted motion vectors from 320t+3 as the reference frame to 320t as the target frame is obtained by cascading the motion vectors from 320t+3 to 320t+2 with the motion vectors from 320t+2 and 320t. The predicted motion vectors from 320t−3′ and 320t can be obtained in a similar manner.
In another embodiment of this invention, in predicting the motion vector for R from non-adjacent frames, the median operator in Equation 1 may be replaced with the arithmetic average of the four motion vectors. In another embodiment, in predicting the motion vector for R, the minimal SAD between the block T and each of the four blocks Si (i=1, 2, 3, 4) may be used in Equation 1 to replace the median of the four motion vectors. In yet another embodiment of this invention, in predicting the motion vector, one may calculate the SAD corresponding to each of the following four motion vectors: MV(n+2,n+1,x,y)+MV(n+1,n,xi,yi) (i=0, 1, 2, 3), and choose the one with the minimal SAD.
ME stage 3: In the last stage, 430 of
Subsequent to motion estimation processing, the image 320t′is subjected to processing for motion-compensated back projection (MCBP). The inputs to this block are the frames and motion estimation results from 320t+k, (k=−3, −2, −1, 1, 2, 3), and frame 320n.
MCBP favors frames that are temporally close to 3201 over frames further away. Temporally close frames are favored because motion estimation is generally more reliable for a pair of frames with a smaller temporal distance than that with a larger temporal distance. Also, this ordering favors the motion estimation results of prior frames over later frames. Thus, MCBP follows the order t−3, t+3, t−2, t+2, t−1, t+1. It is noted, however that other orders can be used.
Referring now to
In a first step, for each block-grid-aligned block R in 320t+3, the corresponding motion-compensated block T in 320t is found using the motion estimation results. For example, if block R is at the position (x, y) in 320t+3 and its motion vector is (mvx, mvy), the corresponding motion compensated block T is the block at the position (x-mvx, y-mvy) in 320t.
In a second step, for each pixel z in the low resolution frame LR(n+3) within the spatial location of block R, the corresponding pixels are identified in block R of 320t+3 based on a pre-determined spatial window, for example, a00 . . . a55, and consequently the corresponding pixels in block T of 320t, for example, a′00 . . . a′55. From the identified pixels in 320t a simulated pixel z′ corresponding to z is generated.
In the second step above, to identify the pixels in 320t corresponding to the pixel z in LR(t+3) and simulate the pixel z′ from these pixels, ideally, the point spread function (PSF) in the image acquisition process is required. Since PSF is generally not available to high-resolution processing and it often varies among video sources, an assumption may be made with regard to the PSF, considering both the required robustness and computational complexity.
For example, a poly-phase down-sampling filter may be used as PSF. The filter may consist, for example, of a 6-tap vertical poly-phase filter and a consequent 6-tap horizontal poly-phase filter. As shown in
where PSFij is the coefficient in the PSF corresponding to a′ij. In another embodiment of this invention, a bi-cubic filter may be used as the PSF.
In a third step, the residue error between the simulated pixel z′ and the observed pixel z is computed, as residue_error=z−z′.
In a fourth step, the pixels in 320t can be updated for example, from pixels a′00 . . . a′55 in 320t to pixels a″00 . . . a″55, according to the calculated residue error as shown at the bottom right in
In the fourth step above, the residue error is scaled by λ*PSFij and added back to the pixel a′ij in 320t to generate the pixel a″ij. The purpose of PSFij is to distribute the residue error to the pixels a′ij in 320t according to their respective contributions to the pixel z′. As proposed herein, the purpose of the scaling factor λ is to increase the robustness of the algorithm to motion estimation inaccuracy and noise. λ may be determined according to the reliability of the motion estimation results for the block R. The motion estimation results can include (mvx, mvy, sad, nact). Among the eight immediate neighboring blocks of R in 320t+3, let sp be the number of blocks whose motion vectors are not different from (mvx, mvy) by 1 pixel (in terms of the high-resolution), both horizontally and vertically. In an embodiment of this invention, λ may be determined according to the following formula:
if sp≧1&&sad<nact*4/4 λ=1;
else if sp≧2&&sad<nact*6/4 λ=½;
else if sp≧3&&sad<nact*8/4 λ=¼;
else if sp≧4&&sad<nact*10/4 λ=⅛;
else if sp≧5&&sad<nact*12/4 λ= 1/16;
else λ=0; (3)
conveying that the contribution from the residue error to updating the pixels in 320t should be proportional to the reliability of the motion estimation results. This proportionality is measured in terms of motion field smoothness, represented by the variable sp in the neighborhood of R and how good the match is between R and T, for example, as represented by comparison of sad and nact.
In certain embodiments of the present invention, in the event of a motion vector with integer motion, lambda may be reduced by half, as the pixel adds less detail to the image but may still be useful for reducing noise.
In another embodiment of this invention, in calculating the scaling factor λ, the reliability of the motion estimation results may be measured using the pixels in 320t and 320t+3 corresponding to the pixel z, i.e., a′00 . . . a55 in 320t+3 and a′00 . . . a′55 in 320t. For example, sad and nact may be computed from these pixels only instead from all the pixels in R and T.
For example, if the block size is 4×4 pixels, the sad between R and T may be defined as in Equation 4:
and act of R may be defined as in Equation 5:
where Ri,j refers to the i,j pixel of R, and likewise Ti,j refers to the i,j pixel of T. Block R is a rectangular area with a top-left pixel of R0,0 and a bottom right pixel of R3,3, likewise block T is a rectangular area with a top-left pixel of T0,0 and a bottom right pixel of T3,3. Equations (4) and (5) are indicative of the fact that the pixels surrounding R and T may also be used in the computation of sad and act. The activity of a block may be used to evaluate the reliability of corresponding motion estimation results. To accurately reflect reliability, act may have to be normalized against the corresponding SAD in terms of the number of absolute pixel differences, as shown below in Equation 6:
where num_pixels_in_sad is the number of absolute pixel differences in the calculation of sad, and num_pixels_in_act is that of act, respectively. The term nact is the normalized activity of the block. Note that the surrounding pixels of R and T may be used in calculating sad and act as well.
The foregoing can be repeated for the frames for each time period t−3, t−2, t−1, t+1, t+2, and t+3, resulting in a motion compensated back predicted higher resolution frame 320t.
Referring now to
Motion-free back projection between frame 320n′ and frame 320n″ are performed similar to motion-compensated back projection, except that all motion vectors are set to zero and the weighting factor λ is a constant.
The modules 820-840 can be implemented in software, firmware, hardware (such as processors or ASICs which may be manufactured from or using hardware description language coding that has been synthesized), or using any combination thereof. The embodiment may further include a processor 870, and an input interface 810 through which the lower resolution images are received and an output interface 860 through which the higher resolved images are transmitted.
An off-chip memory 880 stores the source LR pictures. On-chip memory 851 stores portions of the higher resolution frames that are being updated. Program memory 852 stores instruction for execution by the processor 870.
It is noted that the foregoing image processing involves the transfer and processing of large amounts of data. Storing larger amounts of the data within the integrated circuit 802 increases the cost and consumes more area on the integrated circuit 802. Storing larger amounts of data in the off-chip memory 880 results in higher access times, and consequently, lower throughput.
It is noted that in certain embodiments of the present invention, the higher resolution pictures do not have to be stored in a frame buffer and can be output directly to the display. This can save considerable bandwidth and memory footprint.
Referring now to
For a given lower resolution pixels 905, higher resolution pixel 910 represents the higher resolution pixel for no motion. Box 915 represents an exemplary maximum search range for pixel 910 during motion estimation. Box 925 represents that extent of pixels that can be updated by the point spread function kernel for any of the pixels box 915. Box 920 represents the pixels within the maximum search range and the point spread function.
However, although motion estimation and the point spread function limit have a limited domain, it is possible for a given pixel to affect all of the pixels in the higher resolution frame.
Referring now to
This implies that a great deal more storage of the higher resolution frame is required than simply the output patch A0, or block 920 when attempting to generate A0. This approach also leads to substantial additional computation cycles, as many operations will need to be repeated in the Z0 region.
However, it is possible to reduce the additional storage and computational cycles by clipping this diffusion. This limitation could be applied either both vertically and horizontally, vertically only (process the picture 1 stripe at a time), or horizontally only.
Referring now to
Although low resolution blocks that map outside the diffusion limit can affect the region A0, the likelihood of affecting the region A0 and the impact decreases as the diffusion limit is increased. Accordingly, a diffusion ring should be selected such that the probability and impact of a pixel outside the diffusion ring are acceptably small.
Referring now to
Referring now to
As can be seen from the foregoing, the diffusion ring for patch 0 overlaps with the core of patches 1, 4, and 5. Likewise, the core of patch 0 overlaps with the diffusion ring of patches 1, 4, and 5.
Therefore, while processing patch 0, certain blocks from the lower resolution frame will be needed for processing patches 1, 4, and 5. If the blocks mapped to patch 0 from the lower resolution frame are fetched from an off-chip memory and discarded after processing patch 1, some of the blocks would have to be fetched again during processing of any or all of patches 1, 4, and 5.
To reduce the number of off-chip fetches, blocks that are found to be in the overlapping regions are stored in the on-chip cache 890. During processing of patch 1, 4, and/or 5, the block can be found in the cache, thereby reducing fetch cycles.
Referring now to
If rooms still exists in the FIFO it indicates the particular patch did not consume all the bandwidth allocated to it, and it attempts to fill the caching FIFO using blocks that straddle the patch boundaries. The coordinates of these blocks come from a coordinate caching bandwidth surplus cache FIFO that caches blocks that straddle the line Y2 or Y5. This structure is particularly important as the patch size is vertically decreased.
In certain embodiments of the present invention, the bandwidth surplus cache stores block coordinates, as opposed to the blocks, themselves. In the event that surplus bandwidth is available, the blocks (or portion of the blocks) in the bandwidth surplus cache are fetched and placed in the source data FIFO.
Referring now to
Referring now to
Referring now to
Referring now to
Each blocks in the cache is read out and is then processed into P(k). The coordinates are read out from the block coordinate bandwidth surplus cache FIFO, and these blocks are retrieved and processed at 1615. These fetches count against the “num_fetch_blocks” count for P(k). The motion vectors are scanned, looking for blocks that map into P(k), whose top line of affect on the patch P(K) is below Y2 (Blocks above this have already been processed) at 1620. If the motion vectors indicate a block is fully contained within the region between P(k) and P(k+1), Y3-Y5 at 1625, this block is cached in source data caching FIFO at 1630, again up to the limit “num_fetch_blocks”.
The motion vectors are scanned to include all possible blocks whose top line of destination influence is Y5 or above at 1635. Any block found to match this criterion has its coordinates stored in source data caching FIFO at 1640. Once all the motion vectors that may map into P(k) have been scanned, the source data caching FIFO is checked to see if it is full at 1645.
If not all block coordinates are read out from bandwidth surplus cache FIFO at 1650, and the resultant fetched blocks are stored in source data cache FIFO at 1655, until the source data cache FIFO is full, the surplus bandwidth cache FIFO is exhausted, or the fetch count has reached “num_fetch_blocks”.
Note that the first vertical patch begins with the source data cache FIFO empty. Therefore it is necessary to allow the first vertical position additional fetches beyond “num_fetch_blocks”. Let this be called “num_fetch_blocks_P0”. As a direct result, more processing time needs to be given to P0, to avoid spiking the bandwidth. This directly adds to the delay through the video processing block.
The use of the block coordinate bandwidth surplus cache FIFO, is important for several reasons. The first is that it ensures that the source data cache FIFO empty, will virtually always be full when beginning to process a patch P(k), and as a result it ensures that each patch P(k) will be allowed to process num_cache_block+num_fetch_blocks, allowing for excellent picture quality evenly across the entire destination picture. It also ensures that each patch will use a bandwidth equivalent to up to num_fetch_blocks (and rarely less). Predictable and constant bandwidth are desirable in many video processing applications. Of note, it is desirable in video processing to be able to accept and generate pixels at a steady and predictable rate. The bandwidth surplus FIFO cache also ensures blocks that straddle the boundary between cacheable and non-cacheable regions are only processed once.
Although in the present invention, the diffusion rings are abutting, it is noted that the diffusion rings need not abut or may overlap.
It is noted that additional complex bandwidth sharing schemes are possible in certain embodiments of the invention. For example, in certain embodiments of the invention, if another part of the chip is not using the full bandwidth allocation over a particular window of time the caching structure may go beyond the num_fetch_block limit, or vice versa.
Referring now to
At 1730, the motion estimation module 830 performs motion estimation. At 1740, the motion compensated back projection module 840 performs motion compensation back projection for blocks that are stored in the cache 890 that map to a particular destination domain. At 1745, the DMA 885 fetches blocks from the off-chip memory that map to the destination domain patch, and updates the higher resolution image by projecting the block onto destination domain patch.
However, should the block lie in a region that is overlapped by another destination domain patch at 1750, the block is written to cache 890 and is available for later use. Otherwise, the block is discarded. The foregoing is repeated until the entire higher resolution frame is updated. At 1760, the higher resolution picture is updated by motion-free back projection module 850 using motion-free back projection. The foregoing is repeated for each destination patch in the higher resolution frame.
Example embodiments of the present invention may include such systems as personal computers, personal digital assistants (PDAs), mobile devices (e.g., multimedia handheld or portable devices), digital televisions, set top boxes, video editing and displaying equipment and the like.
The embodiments described herein may be implemented as a board level product, as a single chip, application specific integrated circuit (ASIC), or with varying levels of the system integrated with other portions of the system as separate components. Alternatively, certain aspects of the present invention are implemented as firmware. The degree of integration may primarily be determined by the speed and cost considerations.
While the present invention has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the present invention.
Therefore, it is intended that the present invention not be limited to the particular embodiment disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims and equivalents thereof.