The present invention is directed to systems and methods for processing a video captured using a low resolution, low speed video camera to generate a high resolution, high speed video as output.
Image enhancement is a method where a high resolution image is constructed from a low resolution image to produce an image which has been magnified to some degree. Prior art methods involve using an interpolation technique to fill in areas of the image where no data exists. However, the quality of the magnified image is inherently limited because, as the magnification factor increases, image detail degrades as larger areas of missing data in the image need to be generated. For video, the problem requires not only magnifying each frame of the video in the x, y direction, but also magnifying the video to increase the frame rate of the video sequence such that the video is enhanced both spatially and temporally.
Accordingly, what is needed in this art is a sophisticated video processing system and method which receives a lower resolution, lower frame rate video and processes that video to generate a higher resolution, higher frame rate video as output while minimizing image degradation.
What is disclosed is a video processing system and method which receives a lower resolution, lower frame rate video and processes that video to generate a higher resolution, higher frame rate video with minimized image degradation. The present method reconstructs missing image data in the x, y and t direction using a compressed sensing framework. The teachings hereof enable video magnification in all three directions in real-time as the video is being captured. The present techniques find their uses in a wide array of diverse applications such as medical imaging, satellite imaging, military and homeland security applications, to name a few.
In one embodiment, the present method for increasing the temporal and spatial resolution of a video involves the following. First, a plurality of image frames of a video are received for processing. The video has been captured using a video camera with a spatial resolution of (M×N) in the (x, y) direction, respectively, and a temporal resolution (T) in frames per unit of time. First and second magnification factor f1, f2 are selected for a desired amount of spatial enhancement in (x, y). A third magnification factor f3 is also selected for a desired amount of temporal enhancement in T. The received image frames are processed using a dictionary comprising a set of high and low resolution patch cubes. In a manner more fully disclosed herein, the patch cubes of video data are used to induce spatial and temporal components in the video where no data exists. A high resolution course video X0 is generated as a result of processing the received video. The high resolution course video has a spatial resolution of (f1*M)×(f2*N) and a temporal resolution of (f3*T). The course video can be smoothed to produce a smoothed video for subsequent viewing and analysis.
Many features and advantages of the above-described method will become readily apparent from the following detailed description and accompanying drawings.
The foregoing and other features and advantages of the subject matter disclosed herein will be made apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:
What is disclosed is a video processing system and method which receives a lower resolution, lower frame rate video and processes that video to generate a higher resolution, higher frame rate video with minimized image degradation.
“Spatial resolution” refers to the number of pixels in a given image frame of a video in the (x, y) direction. Spatial resolution is typically in pixels per image (ppi). A camera with a higher spatial resolution means the camera can capture a higher density of pixels per image.
“Temporal resolution”, also called “frame rate”, refers to the number of images a video camera is capable of capturing in a pre-defined amount of time. Temporal resolution T is in frames per unit of time, typically frames per second (fps). A video camera with a higher temporal resolution means that the camera can acquire more images over a given amount of time, i.e., at a higher frame rate.
A “spatial magnification factor” is a multiplier that is used herein to define an amount of video enhancement (in one or both of the x, y directions). In various embodiments hereof, a first and second spatial magnification factor f1, f2, where (f1≧1) and (f2≧1), are selected or otherwise received for a desired resolution enhancement in the (x, y) direction, respectively. For example, given a selected spatial magnification factor of f1=f2=3, a received video that was captured by a video camera with a spatial resolution of 320×320 pixels per image would have a post-processed spatial resolution of (3*320)×(3*320), i.e., 960×960 pixels per image.
A “temporal magnification factor” is a multiplier that is used herein to define an amount of video enhancement in the number of frames per second (T-direction). In various embodiments hereof, a temporal magnification factor f3, where f3≦2, is selected or otherwise received for a desired temporal resolution enhancement in the T-direction. For example, given a selected temporal magnification factor of f3=5, a received video that was captured by a video camera with a frame rate of 20 fps would have a post-processing temporal resolution of (5*20), i.e., 100 fps.
“Selecting”, as used herein, is intended to be widely construed. Selecting a magnification factor means to choose, provide, enter, receive or otherwise obtain information which identifies a desired amount of video enhancement for the purposes hereof. Selections can be made by a user using, for instance, a keyboard of a computer workstation or using a mouse. Making a selection may involve using pull down menus and selecting one or more menu options as is normally understood, using for example a mouse-click operation or making an entry using a keyboard or keypad. Such a selection may be pre-set or otherwise defined. Selectable menu options, values, parameters, and the like, may be retrieved from a memory or storage and provided to a processor executing machine-readable program instructions to effectuate such selections. One or more selections may be obtained from an application or from a remote device over a network.
A “video”, as is generally understood, refers to a time-varying sequence of 2D images which have been captured using a video camera at a given frame rate. The video is received for processing.
“Receiving a video” is intended to be widely construed and means to retrieve, receive, capture with a video camera, or otherwise obtain a video for processing in accordance with the present method. The video can be retrieved from a memory or internal storage of the video camera system, or obtained from a remote device over a network. The video may also be retrieved from a media such as a CDROM, DVD, or USB drive, for example. The video may be downloaded from a website for processing. The video can be captured and processed using a handheld cellular device or a handheld computing device such as an iPad. The received video is processed using a dictionary.
A “dictionary” contains a set of high resolution patch cubes Ph. In another embodiment, the dictionary further contains a set of low resolution patch cubes Dl. The low and high resolution patch cubes of video data are used to induce spatial and temporal components in the received video where no data exits to obtain a resulting enhanced video having the user-desired spatial and temporal magnification. It should be understood that the video data comprising any given patch cube can be written as a vector. As such, terms such as “patch cube” and “vector” are used interchangeably. The low and high resolution patch cubes comprising various embodiments of the dictionary are obtained from having processed low and high resolution training videos.
A “low resolution training video” is a video captured using a video camera with a resolution of (M×N×T) such that the low resolution training video has the same spatial and temporal resolution as the received video desired to be processed. Assume for discussion purposes that a portion of a pre-processed video (of
Xl=SXh (1)
where S is a down-sampling operator.
A “high resolution training video” is a video captured using a video camera with a resolution of (f1*M)×(f2*N)×(f3*T) such that the high resolution training video has the same spatial and temporal resolution as post-processed received video. In one embodiment, the set of high resolution patch cubes is obtained from a high resolution training video. In a similar manner as discussed with respect to obtaining the low resolution patch cubes Dl, the set of high resolution patch cubes Ph are obtained by randomly moving a spatially and temporally magnified 3D box of size (f1*S1)×(f2*S2)×(f3*S3) through the image frames of one or more high resolution training videos. Each location of the 3D box as it is re-positioned through the high resolution training video data defines a respective high resolution patch cube for the dictionary. Assume, for discussion purposes, that the video of
“Processing the video” means using the dictionary to spatially and temporally enhance the received video to obtain an output video having the desired magnification.
A cross-section of received video data is repeatedly identified using a 3D box of size at least (3×3×3), starting from the first pixel of image data. Each cross-section of video data is processed by identifying a high resolution patch cube of data in the dictionary to magnify (patch) this particular 3D piece of video data. The identified cross-section of low resolution video data is patched as follows. If the first magnification factor f1≠1, the identified high resolution patch cube is used to increase a size of the identified cross-section of video data by f1 pixels in the x-direction. If the selected second magnification factor f2≠1, the identified high resolution patch cube is used to increase a size of the identified cross-section by f2 pixels in the y-direction. And, the identified high resolution patch cube is used to increase a number of frames of the cross-section of video data by f3 frames in the T-direction. Thereafter, the 3D box is shifted to identify a next cross-section of video data for processing with the following constraints. (1) If the first magnification factor f1≠1, then the 3D box is shifted such that the next cross-section of video data overlaps the previous cross-section of video data by at least f1 pixels in the x-direction. (2) If the second magnification factor f2≠1, then the 3D box is shifted such that the next cross-section of video data also overlaps the previous cross-section of video data by at least f2 pixels in the y-direction. (3) The 3D box is shifted such that the next cross-section of video data also overlaps the previous cross-section of video data by at least 1 frame in the T direction. Once the next cross-section of video data has been identified, this data is processed. The 3D box is again shifted with the constraints having been met and that cross-section of video data is processed. The process repeats until all video data has been processed.
A low resolution vector y can be considered a sparse linear combination of low-resolution patch cubes Dl. This relationship is given by:
y=α*D
l (2)
α*=arg minλ∥α∥1+∥{circumflex over (D)}α−ŷ∥2 (3)
where {circumflex over (D)}=[flDl PHh]T, (FlDl and PDh are matrices of low and high resolution patch cubes, respectively), T is a matrix transpose operation, Fl is a linear feature extraction operator, P is a matrix operator for the region of overlap occurring between a target patch cube and a previously reconstructed patch cube, and ŷ is the concatenated vector of y and w where w contains the value of the reconstructed high resolution patch cube on the overlap. We use α* and the set of high resolution patch cubes Ph to repeatedly accumulate high-resolution vectors x using x=α*Dh to obtain course video X0.
Using the steepest decent, we find the closest image to X0 which satisfies the constraint:
X*=arg minx∥Y−SHX∥22+c∥X−X0∥22 (4)
where Y is the received video, H is a blurring filter, and S is a down-sampling operator.
The course video is smoothed by:
X(k+1)=X(k)−υHTST[SHX(k)−Y]−υc[X(k)−X0] (5)
where X(k) is the optimal solution at the kth iteration, υ is an adaptation constant, Y is the received low resolution image, X0 is the course high resolution image, and c is a weighting factor.
Reference is now being made to the flow diagram of
At step 702, receive a plurality of image frames of video data for processing. The video data has been captured using a video camera having a resolution of (M×N×T), where (M×N) is a spatial resolution of the camera in (x, y), respectively, and T is a temporal resolution of the camera in frames per unit of time.
At step 704, select a first magnification factor f1 where f1≧1 for a desired spatial enhancement in the x-direction.
At step 706, select a second magnification factor f2 where f2≧1 for a desired spatial enhancement in the y-direction.
At step 708, select a third magnification factor f3, where f3≧2, for a desired temporal enhancement in T.
At step 710, process the image frames of video data using a dictionary of high resolution patch cubes Ph and low resolution patch cubes Dl to induce spatial and temporal components in the video data where no data exists. The processing generates, as output, a high resolution course video X0 having a spatial resolution of (f1*M)×(f2*N) and a temporal resolution of (f3*T).
At step 712, communicate the high resolution course video to a storage device. The high resolution course video can be communicated to a memory or a graphical display device for viewing. The generated course video can be communicated to a workstation for further processing. In this embodiment, further processing stops. In other embodiments, the high resolution course video is smoothed to generate a smoothed high resolution video.
It should be appreciated that the flow diagrams hereof are illustrative. One or more of the operative steps illustrated in any of the flow diagrams may be performed in a differing order. Other operations, for example, may be added, modified, enhanced, condensed, integrated, or consolidated with the steps thereof. Such variations are intended to fall within the scope of the appended claims. All or portions of the flow diagrams may be implemented partially or fully in hardware in conjunction with machine executable instructions.
Reference is now being made to
In
Any of the modules and processing units of
It will be appreciated that the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may become apparent and/or subsequently made by those skilled in the art which are also intended to be encompassed by the following claims. Accordingly, the embodiments set forth above are considered to be illustrative and not limiting. Various changes to the above-described embodiments may be made without departing from the spirit and scope of the invention. The teachings hereof can be implemented in hardware or software using any known or later developed systems, structures, devices, and/or software by those skilled in the applicable art without undue experimentation from the functional description provided herein with a general knowledge of the relevant arts. Moreover, the methods hereof can be implemented as a routine embedded on a personal computer or as a resource residing on a server or workstation, such as a routine embedded in a plug-in, a driver, or the like. The methods provided herein can also be implemented by physical incorporation into an image processing or color management system. Furthermore, the teachings hereof may be partially or fully implemented in software using object or object-oriented software development environments that provide portable source code that can be used on a variety of computer, workstation, server, network, or other hardware platforms. One or more of the capabilities hereof can be emulated in a virtual environment as provided by an operating system, specialized programs or leverage off-the-shelf computer graphics software such as that in Windows, Java, or from a server or hardware accelerator or other image processing devices.
One or more aspects of the methods described herein are intended to be incorporated in an article of manufacture, including one or more computer program products, having computer usable or machine readable media. The article of manufacture may be included on at least one storage device readable by a machine architecture embodying executable program instructions capable of performing the methodology described herein. The article of manufacture may be included as part of an operating system, a plug-in, or may be shipped, sold, leased, or otherwise provided separately either alone or as part of an add-on, update, upgrade, or product suite. It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be combined into other systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may become apparent and/or subsequently made by those skilled in the art which are also intended to be encompassed by the following claims. Accordingly, the embodiments set forth above are considered to be illustrative and not limiting. Various changes to the above-described embodiments may be made without departing from the spirit and scope of the invention. The teachings of any printed publications including patents and patent applications, are each separately hereby incorporated by reference in their entirety.