The present application claims priority to Chinese Patent Application No. 202410055130.9, filed Jan. 12, 2024, and entitled “Method, Device, and Computer Program Product for Generating Super-Resolution Image Model,” which is incorporated by reference herein in its entirety.
Embodiments of the present disclosure relate to the field of computers, and more particularly, to a method, an electronic device, and a computer program product for generating a super-resolution image model.
Image/video super-resolution is a basic signal processing technology in computer vision. In particular, image/video super-resolution is the foundation for digitalization and communication. The goal of this technology is to compress rich spatial/temporal information into a denser space without losing the original quality. In view of the fact that human beings are now living in the era of big data, the amount of data is growing exponentially, especially when more high-definition devices are available. On the other hand, it is expensive to transmit high-definition images/videos with megapixels over the Internet. Using super resolution only at the edge can significantly reduce the burden of data transmission and compression/decompression.
Embodiments of the present disclosure provide a solution for generating or training a super-resolution image model.
In a first aspect of the present disclosure, a method for generating a super-resolution image model is provided. The method includes acquiring a first image with a first resolution and a second image with a second resolution. Here, the first image corresponds to the second image. The method further includes generating a first super-resolution image with a first super resolution and a second super-resolution image with a second super resolution based on the first image according to an initial super-resolution image model. The method further includes transforming the first super-resolution image into a first frequency-domain representation. The method further includes transforming the second super-resolution image into a second frequency-domain representation. The method further includes generating a trained super-resolution image model based on the first frequency-domain representation, the second frequency-domain representation, and a reference frequency-domain representation of the second image.
In a second aspect of the present disclosure, an electronic device is provided. The electronic device includes at least one processor, and a memory coupled to the at least one processor, the memory having instructions stored therein that, when executed by the at least one processor, cause the electronic device to perform actions. The actions include acquiring a first image with a first resolution and a second image with a second resolution. Here, the first image corresponds to the second image. The actions further include generating a first super-resolution image with a first super resolution and a second super-resolution image with a second super resolution based on the first image according to the initial super-resolution image model. The actions further include transforming the first super-resolution image into a first frequency-domain representation. The actions further include transforming the second super-resolution image into a second frequency-domain representation. The actions further include generating a trained super-resolution image model based on the first frequency-domain representation, the second frequency-domain representation, and a reference frequency-domain representation of the second image.
In a third aspect of the present disclosure, a computer program product is provided. The computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions that, when executed by a machine, cause the machine to perform the method according to the first aspect.
This Summary is provided to introduce a selection of concepts in a simplified form, which are further described in the Detailed Description below. The Summary is not intended to identify key features or major features of the present disclosure, nor is it intended to limit the scope of the present disclosure.
The above and other objects, features, and advantages of the present disclosure will become more apparent from description of exemplary embodiments of the present disclosure provided in more detail herein with reference to the accompanying drawings. In exemplary embodiments of the present disclosure, the same reference numerals generally represent the same components. In the accompanying drawings:
Principles of the present disclosure will be described below with reference to several example embodiments shown in the drawings. Although illustrative embodiments of the present disclosure are shown in the drawings, it should be understood that these embodiments are described only to enable those skilled in the art to better understand and implement the present disclosure, and are not intended to limit the scope of the present disclosure in any way.
As used herein, the term “include” and its variants mean open inclusion, that is, “include but not limited to.” Unless otherwise stated, the term “or” means “and/or.” The term “based on” means “at least partially based on.” The terms “an exemplary embodiment” and “an embodiment” mean “at least one exemplary embodiment.” The term “another embodiment” means “at least one additional embodiment.” The terms “first,” “second,” and the like may refer to different objects or the same object. Other explicit and implicit definitions may be included below.
In the related art, there are many machine learning solutions based on a neural network for image/video super resolution. Using these machine learning solutions, the details of an image can be reconstructed to a certain extent, so that the generated video can have sharp visual quality. In some solutions, high-frequency information of successive images or videos is used to improve prediction results. Among those, a model based on a band-limited network can project the coordinates into the high-frequency domain, and then reconstruct the image in consideration of the high-frequency information.
However, the ability of a deep neural network to learn high-frequency information is limited, so that the model based on high-frequency information cannot guarantee the contribution of high-frequency reconstruction to the final video, thus affecting the performance of the obtained high-resolution model based on a deep neural network.
In view of this, an embodiment of the present disclosure provides a solution to generate a high-resolution image model by using the frequency-domain loss to solve one or more of the above problems and other potential problems. In this solution, in the training process, the initial model generates multiple super-resolution images with different super resolutions based on a training image. Then, the generated super-resolution images are transformed into the frequency domain to obtain frequency-domain representations corresponding to the spatial representations of the multiple super-resolution images. At the same time, the image as a real value image corresponding to the training image is also transformed into a frequency-domain representation. Finally, the initial model is optimized or trained through the frequency-domain loss between the generated super-resolution image and the real value image, so as to obtain a trained super-resolution image model configured to generate images with different super resolutions based on the input image.
In an embodiment of the present disclosure, by minimizing the real image and supervising the optimization of the network by applying the frequency enhancement process, the high-frequency components of the reconstructed super-resolution image and the real value image are aligned, and the missing information is recovered, thereby improving the prediction effect and allowing the created super-resolution image to have a sharper visual effect.
On the accessor side, the environment 100 includes a desktop computer 122, a laptop computer 124, a smartphone 126, and a smart projection device 128, each illustratively an example of a terminal device, which may comprise or be otherwise associated with a computing device on the accessor side. It should be understood that a given computing device on the accessor side as shown in the figure is only an example, and such a computing device may include any computing device suitable for deploying the spatial-temporal super-resolution image model 106 according to an embodiment of the present disclosure. In some embodiments, the computing device on the accessor side can communicate with the server 108 and send an access request for stored data to the server 108. For example, the desktop computer 122, the laptop computer 124, the smartphone 126, and the smart projection device 128 may each send a request for the original video 102 to the server 108. In some embodiments, the request sent by the computing device on the accessor side may compute the device information of the device. For example, the device information may include the resolution and frame rate supported by the computing device. After the server 108 receives the request from the computing device on the accessor side, the server 108 can determine the specification of the video that should be generated by, for example, viewing the device information of the computing device that made the request. For example, the desktop computer 122 can support a first resolution and a first frame rate. The laptop computer 124 can support a second resolution and a second frame rate. The smartphone 126 can support a third resolution and a third frame rate. The smart projection device 128 can support a fourth resolution and a fourth frame rate.
In the embodiment shown in
In the illustrated embodiment, the high-resolution video 114 received by the smart phone 126 has three frames, the high-resolution video 110 received by the desktop computer 122 has four frames, the high-resolution video 112 received by the laptop computer 124 has five frames, and the high-resolution video 116 received by the smart projection device 128 has six frames. In the process of generating the spatial-temporal super-resolution image model 106 according to embodiments of the present disclosure, the performance of the spatial-temporal super-resolution image model 106 is improved by considering the frequency-domain loss between the generated image and the real image, thus making the generated super-resolution image sharper.
As shown in
In some embodiments, in order to obtain the training data and the real value corresponding to it, the server 108 can first obtain a second image with a super resolution as the real value, also referred to herein as a real value image or simply a real image. After that, the server 108 can reduce the resolution of the second image to obtain the first image with a lower resolution as the training image.
At 204, the method 200 includes generating a first super-resolution image with a first super resolution and a second super-resolution image with a second super resolution based on the first image according to an initial super-resolution image model. For example, in the embodiment shown in
At 206, the method 200 includes transforming the first super-resolution image into a first frequency-domain representation. For example, in the embodiment shown in
In some embodiments, the server 108 can extract a feature map of the generated super-resolution image to obtain a quantized representation of the super-resolution image. The feature map can be implemented by any suitable encoder, for example. After that, the server 108 can determine the feature vector of each pixel in the super-resolution image in the feature map. Thus, a set of feature vectors including the feature vector of each pixel in the super-resolution image can be obtained. Finally, the server 108 can perform a Fourier transform on the obtained set of feature vectors to obtain a set of sub-frequency-domain representations as the first frequency-domain representation.
In this embodiment, the frequency-domain representation can be obtained by discrete Fourier transform. For example, the frequency-domain representation can be expressed as:
At 210, the method 200 includes training the initial super-resolution image model based on the loss between the first frequency-domain representation and the second frequency-domain representation and the reference frequency-domain representation of the second image to generate a super-resolution image model. For example, in the embodiment shown in
In the embodiment shown in
As shown in
At 306, the server 108 determines a frequency-domain error based on the square of the first frequency-domain difference and the square of the second frequency-domain difference. Here, for example, the frequency-domain error can be expressed as the square of the frequency-domain difference as follows:
At 308, the server 108 determines an error weight based on the first frequency-domain difference and the second frequency-domain difference. Here, the error weight can for example be expressed as:
At 310, the server 108 determines a frequency-domain loss based on the frequency-domain error and the error weight. In some embodiments, the frequency-domain loss, that is, the loss function of the frequency domain, may be an average weighted sum of the frequency domain errors based on the obtained error weights. The frequency loss can for example be expressed as:
In the embodiment shown in
It should be understood that although in the embodiment shown in
As shown in
In the process of generating the super-resolution image, the super-resolution image model will predict the pixel value of the pixel indicated by each spatial-temporal coordinate in an input video. First, the spatial-temporal coordinate 504 is randomly selected from the query coordinate map 502. The spatial-temporal coordinate 504 corresponds to the spatial-temporal coordinate (u, v, t) in the set consisting of the image 506 and the image 508. After obtaining the spliced feature map 512, a feature vector 514 corresponding to the spatial-temporal coordinate (u, v, t) is selected. Then, the feature vector 514 along with the spatial coordinate (u, v) in the spatial-temporal coordinate (u, v, t) are input to the spatial super-resolution sub-module 518 in the spatial-temporal super-resolution model of the super-resolution image model. In addition, the scaling factor 516 is also input to the spatial super-resolution sub-module 518. The spatial super-resolution sub-module 518 reconstructs the feature vector 514 into a set of features 520 with the number and super resolution indicated by the scaling factor 516. The set of features 520 along with the temporal coordinate (t) in the spatial-temporal coordinate (u, v, t) are input to the temporal super-resolution sub-module 524 in the spatial-temporal super-resolution module of the super-resolution image model. The temporal super-resolution sub-module 524 determines the spatial-temporal representation in the optical flow 526 corresponding to the pixel indicated by the spatial-temporal coordinate (u, v, t). After that, the spatial-temporal representation is input to the decoding sub-module 528 of the spatial-temporal super-resolution module, and the decoding sub-module 528 selects the feature vector 530 corresponding to the spatial-temporal coordinate from the feature map obtained based on the spatial-temporal representation. After that, the feature vector 530 along with the spatial-temporal coordinate are input to the spatial super-resolution sub-module 532 of the spatial super-resolution module to obtain the recovered features. Finally, the obtained features are input to the decoder 534 to obtain the pixel values corresponding to the spatial-temporal coordinate (u, v, t) at all the resolutions indicated by the scaling factor. After traversing all the pixel coordinates, super-resolution images 536 with different resolutions are obtained.
After that, the frequencies or w; in the Equations (6) and (7) are set in a frequency band [−Bi, Bi] to limit the output frequency. In this embodiment, the spatial super-resolution model used limits the specific frequency. However, as shown above, in the operation of the model, information of some frequencies may be lost, which leads to deterioration of the results output by the model. Therefore, the optimization method based on the frequency-domain loss is especially suitable for spatial super-resolution image model based on a band-limited network.
Multiple components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, a mouse, and the like; an output unit 707, such as various types of displays, speakers, and the like; the storage unit 708, such as a magnetic disk, an optical disk, and the like; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
Various procedures and processes described above, such as the methods 200 and 300, can be performed by the CPU 701. For example, in some embodiments, the methods 200 and 300 can be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program can be loaded and/or installed on the device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the CPU 701, one or more actions of the methods 200 and 300 described above can be performed.
Illustrative embodiments of the present disclosure include a method, apparatus, system, and/or computer program product. The computer program product may include a computer-readable storage medium on which computer-readable program instructions for performing various aspects of the present disclosure are loaded.
The computer-readable storage medium may be a tangible device that can maintain and store instructions to be used by an instruction execution device. For example, the computer-readable storage medium may be, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof. More specific examples (a non-exhaustive list) of the computer-readable storage medium include: a portable computer disk, a hard disk, a RAM, a ROM, an Erasable Programmable Read-Only Memory (EPROM or flash memory), a static random access memory (SRAM), a portable Compact Disc Read-Only Memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as a punch card or a raised structure in a groove on which instructions are stored, and any suitable combination thereof. The computer-readable storage medium used herein is not to be interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through optical fiber cables), or electrical signals transmitted through wires.
The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing/processing devices, or downloaded to an external computer or external storage device through a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, optical fiber transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device.
Computer program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk and C++, and conventional procedural programming languages such as “C” language or the like. Computer-readable program instructions may be completely executed on a user computer, partially executed on a user computer, executed as a stand-alone software package, partially executed on a user computer and partially executed on a remote computer, or completely executed on a remote computer or server. In the case involving a remote computer, the remote computer may be connected to a user computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (e.g., through the Internet using an Internet service provider). In some embodiments, by personalizing and customizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), using the state information of computer-readable program instructions, the electronic circuit can execute the computer-readable program instructions to implement various aspects of the present disclosure.
Various aspects of the present disclosure are described herein with reference to flowcharts and/or block diagrams of the method, apparatus (system), and computer program product according to embodiments of the present disclosure. It should be understood that each block of the flowcharts and/or block diagrams and combinations of blocks in the flowcharts and/or block diagrams can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine, such that these instructions, when executed by the processing unit of the computer or other programmable data processing apparatus, produce means for implementing the functions/acts specified in one or more blocks in the flowcharts and/or block diagrams. These computer-readable program instructions may also be stored in a computer-readable storage medium, and cause a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, so that the computer-readable medium having the instructions stored thereon includes an article of manufacture including instructions for implementing various aspects of the functions/acts specified in one or more blocks in the flowcharts and/or block diagrams.
The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatuses, or other devices, such that a series of operational steps are performed on the computer, other programmable data processing apparatuses, or other devices to produce a computer-implemented process, such that the instructions executed on the computer, other programmable data processing apparatuses, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
The flowcharts and block diagrams in the drawings show the architecture, functions, and operations of possible implementations of systems, methods, and computer program products according to multiple embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a part of an instruction, which contains one or more executable instructions for implementing specified logical functions. In some alternative implementations, the functions noted in the blocks may also occur in a different order than that noted in the drawings. For example, two successive blocks may actually be executed substantially in parallel, and they may sometimes be executed in a reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and/or flowcharts and combination of blocks in the block diagrams and/or flowcharts can be implemented by a dedicated hardware-based system that performs specified functions or actions, or can be implemented by a combination of dedicated hardware and computer instructions.
Illustrative embodiments of the present disclosure have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Numerous modifications and changes will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. The terminology used herein is chosen to best explain the principles, practical applications and associated technical improvements of the various embodiments disclosed herein, so as to enable those of ordinary skill in the art to understand the various embodiments disclosed herein.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202410055130.9 | Jan 2024 | CN | national |