This application is based on and claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2022-0178683, filed on Dec. 19, 2022, in the Korean Intellectual Property Office, the contents of which are incorporated by reference herein in their entirety.
The inventive concepts relate to an interpolation model learning method and a device for learning an interpolation frame generating module.
An artificial intelligence (AI) system is a computer system, which implements human-like learning, and is a system where a machine determines autonomously, performs learning, and advances. A recognition rate of the AI system is enhanced as the use thereof increases, and recently, various kinds of electronic devices or data processing systems are being applied to the AI system.
Various kinds of neural network models based on machine learning or deep learning are being applied to the AI system. As neural network technology advances and hardware for reproducing and storing a high-resolution/high-quality image or a slow motion image is developed and supplied, the needs for methods and apparatuses for effectively generating an interpolation frame of an image by using a neural network are increasing.
The inventive concepts provides an interpolation model learning method and a device for learning an interpolation frame generating module, which may extract and compare temporal-spatial features of continuous frames to enhance the performance of an interpolation model, and a computer-readable non-transitory storage medium storing the interpolation model learning method.
According to an aspect of the inventive concepts, there is provided an interpolation model learning method including extracting, from a video including a plurality of continuous frames, a first ground truth (GT) frame and a plurality of frames; generating a first interpolation frame based on the plurality of frames by using an interpolation model; extracting a first temporal-spatial feature of a first frame group including at least three frames including the first GT frame, based on a first feature extraction model based on a neural network; extracting a first temporal-spatial feature of a first frame group including at least three frames including the first GT frame, based on a first feature extraction model, the first feature extraction model being based on a neural network; and training the interpolation model, based on the first temporal-spatial feature and the second temporal-spatial feature.
According to another aspect of the inventive concepts, there is provided a device configured to learn an interpolation frame generating model, the device including a memory storing a program configured to train the interpolation frame generating model; and a processor configured to execute the program stored in the memory, wherein the processor is configured to, by executing the program, extract a first frame, a second frame, and a first ground truth (GT) frame, temporally arranged between the first frame and the second frame, from a plurality of continuous frames and generate a first interpolation frame between the first frame and the second frame using the interpolation frame generating model, extract a first complex feature between a plurality of frames included in a first frame group, based on a first feature extraction model, the first frame group comprising the first GT frame and at least two frames, extract a second complex feature between a plurality of frames included in a second frame group, based on the first feature extraction model, the second frame group comprising the first interpolation frame and the at least two frames, and train the interpolation frame generating model, based on the first complex feature and the second complex feature.
According to another aspect of the inventive concepts, there is provided a non-transitory computer-readable storage medium storing instructions, which when executed by a processor, cause the processor to perform learning of an interpolation model using a plurality of continuous frames, the plurality of continuous frames comprise a first frame, a second frame, and a first ground truth (GT) frame temporally arranged between the first frame and the second frame, the instructions including generate a first interpolation frame between the first frame and the second frame using the interpolation model; extract a first temporal-spatial feature of a first frame group, including at least three frames including the first GT frame, using a neural network model for extracting a temporal-spatial feature between frames; extract a second temporal-spatial feature of a second frame group, including at least three frames including the first interpolation frame, using the neural network model; determine temporal-spatial loss information based on a difference between the first temporal-spatial feature and the second temporal-spatial feature; and train the interpolation model by using the temporal-spatial loss information.
Embodiments will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.
In the drawings, the functional blocks denoting elements that process (and/or perform) at least one function or operation and may be included in and/or implemented as processing circuitry such hardware, software, or the combination of hardware and software. For example, the processing circuitry more specifically may include (and/or be included in), but is not limited to, a processor, Central Processing Unit (CPU), a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a System-on-Chip (SoC), a programmable logic unit, a microprocessor, application-specific integrated circuit (ASIC), semiconductor elements in an integrated circuit, circuits enrolled as an intellectual property (IP), etc.
The electronic system 100 of
The neural network may include various kinds of neural network models such as GoogleNet, AlexNet, convolution neural network (CNN) such as visual geometry group (VGG) network, region with convolution neural network (R-CNN), region proposal network (RPN), recurrent neural network (RNN), stacking-based deep neural network (S-DNN), state-space dynamic neural network (S-SDNN), deconvolution network, deep belief network (DBN), restricted Boltzman machine (RBM), fully convolutional network, long short-term memory (LSTM) network, a generative adversarial network (GAN), and/or classification network; and/or include linear and/or logistic regression, statistical clustering, Bayesian classification, decision trees, and/or the like, but are not limited thereto. Also, the neural network may include sub neural networks, and the sub neural networks may be implemented as heterogeneous neural networks.
The electronic system 100 of
Referring to
Video frame interpolation may be for generating an accurate intermediate frame (or interpolation frame) between two input frames. The performance of a video frame interpolation algorithm may be based on high-level inference quality corresponding to occlusion and an operation on two frames.
To accomplish high-level inference quality, the learning module 110 may train the neural network (for example, a deep learning model). In at least one embodiment, the learning module 110 may train the neural network model used by the interpolation frame generating module 120. Hereinafter, the neural network model used to generate the interpolation frame by using the interpolation frame generating module 120 may be referred to as an interpolation model (or an interpolation frame generating model).
The learning module 110 may include feature extraction models. For example, in at least one embodiment, the feature extraction models may include at least two of a three-dimensional (3D) feature extraction module (510 of
The learning module 110 may receive a frame from the training database 130 and the interpolation frame generating module 120 and may train the interpolation model by using the modules (510, 520, 610, and 620) described above, based on a visual-spatial feature of a frame group including a plurality of frames or a spatial feature of the received frame. For example, various parameters (for example, bias, weight, etc.) of the neural network (see
The interpolation frame generating module 120 may generate the interpolation frame based on the interpolation model, and thus, an operation of training the interpolation model may be referred to as an operation of training the interpolation frame generating module 120.
The interpolation frame generating module 120 may receive a plurality of frames from the training database 130 to generate an interpolation frame between the plurality of frames, based on the interpolation model trained by the learning module 110.
Here, an interpolation frame may be a frame which is generated based on at least two continuous frames and may be temporally arranged between two frames. Because the interpolation frame is generated, the number of frames of a real-time rendering image or a conventional video (continuous frames) may increase, and a reduction in image quality such as shaking of a video may be prevented, thereby naturally displaying an image.
The interpolation frame is not a real (or actually) photographed frame, and because the interpolation frame is a frame generated based on the real photographed frame, the interpolation frame may have a difference with a ground truth (GT) frame. In this case, as a spatial and/or temporal-spatial feature of the GT frame is well reflected in the interpolation frame (as a spatial and/or temporal-spatial feature difference is reduced), the quality of an image by the interpolation frame may increase.
Here, the GT frame may be a frame temporally corresponding to an interpolation frame and may be a right answer frame. Therefore, an interpolation model may be trained based on a comparison result obtained by comparing the interpolation frame with the GT frame, and thus, the performance of the interpolation frame generating module 120 may be enhanced.
Real photographed frames may be stored in the training database 130. The real photographed frame may include at least two frames for generating the GT frame and the interpolation frame corresponding to the GT frame.
In another embodiments, a plurality of interpolation frames and a plurality of GT frames may be used. In this case, the plurality of interpolation frames and the plurality of GT frames may each be frames temporally corresponding to each other.
The interpolation frame generating module 120 may extract characteristic information about an input frame, based on a trained neural network (for example, a neural network trained or retrained by the learning module 110), and may obtain an interpolation frame based on the extracted characteristic information. For example, in order to perform a task needed for the electronic system 100, the interpolation frame generating module 120 may perform inference on an input image based on the neural network. Operations of the neural network performed in an inference process may be performed a separate accelerator (for example, a graphics processing unit (GPU), a neural processing unit (NPU), a digital signal processor (DSP), or a field programmable gate array (FPGA)).
The learning module 110 and the interpolation frame generating module 120 may be implemented as software, hardware, or a combination of software and hardware. In at least one embodiment, the learning module 110 and the interpolation frame generating module 120 may be implemented as an operating system (OS) or a software type at a lower end thereof, and moreover, may be implemented as programs capable of being loaded into a memory included in the electronic system 100 and may be executed by at least one processor of the electronic system 100.
Referring to
The neural network NN may be n-layers neural networks or a deep neural network (DNN) including two or more hidden layers. For example, as illustrated in
A plurality of layers may be implemented as a conventional layer, a fully-connected layer, and a softmax layer. For example, the convolution layer may include convolution, pooling, and an active function operation. Alternatively, each of the convolution, the pooling, and the active function operation may configure a layer.
An output of each of the plurality of layers 10, 12, 14, and 16 may be referred to as a feature (or a feature map). The plurality of layers 10, 12, 14, and 16 may receive a feature, generated by a previous layer, as an input feature and may perform an operation on the input feature to generate an output feature or an output signal. The feature may denote data expressing various features of input data recognizable by the neural network NN.
The neural network NN may include a number of layers for extracting valid information having a DNN structure and may thus process complicated data sets. Furthermore, the neural network NN is illustrated as including four layers 10, 12, 14, and 16, but this is only at least one embodiment and the neural network NN may include fewer or more layers. Also, the neural network NN may include various layers having different structures unlike the illustration of
Each of the plurality of layers 10, 12, 14, and 16 included in the neural network NN may include a plurality of neurons. A neuron may correspond to a processing element (PE), a unit, or artificial nodes known as the terms similar thereto. For example, as illustrated in
Neurons included in each of the plurality of layers 10, 12, 14, and 16 included in the neural network NN may be connected with one another and may exchange data. One neuron may receive data from other neurons, perform an operation on the data, and output an operation result as other neurons.
An input and an output of each of neurons (nodes) may be referred to as an input activation and an output activation, respectively. For example, an activation may be a parameter which is an output of one neuron and corresponds to an input of each of neurons included in a next layer. Each of neurons may determine an activation thereof, based on weights (for example, w1,12, w1,22, w2,12, w2,22, 23,12, w3,22, etc.), biases (for example, b12, b22, b32, etc.), and activations (for example, a12, a22,, a32, etc.) received from neurons included in a previous layer. A weight and a bias may each be referred to as a parameter (or an operation parameter) used to calculate an output activation in each neuron. For example, a weight may represent a weight between output activations (for example, a11 and a22) of a previous layer, and a bias may represent a value added before being applied to each neuron. As described above with reference to
Parameters such as a weight and a bias of an interpolation model of the interpolation frame generating module 120 may be changed through learning for decreasing loss information, based on temporal-spatial loss information ( in
Referring to
The video 300 may include a plurality of frames, including a first frame (I0) 310, a second frame (I1) 320, and a first GT frame (I0.5) 330. The first GT frame 330 may be temporally arranged between the first frame 310 and the second frame 320. For example, when the first frame 310 is a frame at t=0 and the second frame 320 is a frame at t=1, the first GT frame 330 may be a frame at t=0.5. In this case, “t” may be for relatively expressing a temporal position of a frame, and t may (or may not) be a positive integer.
The first frame 310, the second frame 320, and the first GT frame 330 from the video 300 may be extracted for generating an interpolation frame and learning of an interpolation model based on the neural network NN of the interpolation frame generating module 120.
As described above, the first GT frame 330 may be a frame corresponding to a first interpolation frame (340 of
An interpolation frame generating module 120 may generate a first interpolation frame (Î0.5) 340 through an interpolation model, based on a first frame 310 and a second frame 320.
The first interpolation frame 340 may be a frame temporally corresponding to the first GT frame 330. For example, when the first frame 310 is a frame at t=0 and the second frame 320 is a frame at t=1, the first interpolation frame 340 and the first GT frame 330 may each be a frame at t=0.5. Therefore, the first interpolation frame 340 may be a frame corresponding to the first GT frame 330.
The number of frames input to the interpolation frame generating module 120 for generating an interpolation frame may be two or more. Referring to
Although not shown in
The interpolation frame generating module 120 may generate the interpolation frame based on an interpolation model having a structure of the neural network NN described above with reference to
An interpolation model of the interpolation frame generating module 120 may be trained by extracting and comparing features of the first interpolation frame 340 and the first GT frame 330 each generated by the interpolation frame generating module 120. For example, weights and/or biases of a neural network model (an interpolation model) of the interpolation frame generating module 120 may be adjusted based on a temporal-spatial feature comparison result of a first frame group 350, including a spatial feature comparison result (or the first GT frame) of the first interpolation frame 340 and the first GT frame 330, and a second frame group 360 including the first interpolation frame 340. Detailed description thereof is given with reference to
The learning module 110 of
Here, the first frame group 350 may include at least three frames including at least one GT frame, and the second frame group 360 may include at least three frames including at least one interpolation frame. For example, the first frame group 350 may include a first frame 310, a second frame 320, and a first GT frame 330, and the second frame group 360 may include the first frame 310, the second frame 320, and a first interpolation frame 340.
The 3D feature extraction module 510 may extract a temporal-spatial feature of a continuous frame by using a neural network model (a first feature extraction model) which extracts the temporal-spatial feature. Here, the temporal-spatial feature may be a feature map of a first feature extraction model. The first feature extraction model may be a trained 3D CNN. For example, the neural network model (the trained 3D CNN) extracting the temporal-spatial feature may include a plurality of convolution layers, and the feature map may correspond to input data of a convolution layer and a 3D convolution result of a filter. The 3D convolution may have a structure where adjacent frames (for example, three continuous frames) and a convolution operation are combined, and thus, a feature map (or a 3D feature map) including the temporal-spatial feature based on the continuous frame may be output. Therefore, an operation of extracting the 3D feature map may be referred to as an operation of extracting the temporal-spatial feature.
Referring to
For example, in a case where the 3D feature extraction module 510 extracts the first temporal-spatial feature (F3D) and the second temporal-spatial feature ({circumflex over (F)}3D) based on Resnet3D (R3D), which is a neural network model for extracting a temporal-spatial feature, the 3D loss calculation module 520 may calculate temporal-spatial loss information () by using the following first loss calculation function (Equation 1).
Equation 1 may be the first loss calculation function for calculating temporal-spatial loss information () based on a temporal-spatial feature difference (a difference F3D and {circumflex over (F)}3D) in a case where the 3D feature extraction module 510 extracts a temporal-spatial feature of a frame group by using the R3D (the first feature extraction model). The first loss calculation function based on Equation 1 may represent a 3D perceptual loss function. The 3D perceptual loss function may denote an operation of comparing feature maps which are obtained by allowing an image to pass through a CNN model.
Here, Sgt represents a frame group including at least three frames including a GT frame, and Ŝt represents a frame group including at least three frames including an interpolation frame. Therefore, Sgt may be the first frame group 350, and Ŝt may be the second frame group 360.
Also, σ(Sgt) represents a 3D feature map including a temporal-spatial feature extracted from the second block of the R3D when a frame group (the first frame group 350) including a GT frame is input to the R3D. That is, {circumflex over (F)}3D may correspond to the first temporal-spatial feature (F3D).
Also, σ(Ŝt) may be a 3D feature map including a temporal-spatial feature extracted from the second block of the R3D when a frame group (the second frame group 360) including an interpolation frame is input to the R3D. Also, {circumflex over (F)}3D may correspond to the second temporal-spatial feature ({circumflex over (F)}3D). Therefore, the 3D feature extraction module 510 may extract the first temporal-spatial
feature (σ(Sgt) and F3D of
The 3D loss calculation module 520 may calculate temporal-spatial loss information () based on the first loss calculation function (Equation 1). Referring to Equation 1, represents a norm for a difference between the first temporal-spatial feature (σ(Sgt)) and the second temporal-spatial feature (σ(Ŝt) ) That is, may be temporal-spatial loss information () based on a difference between the first temporal-spatial feature (F3D) and the second temporal-spatial feature ({circumflex over (F)}3D), and the learning module 110 may change a weight and/or a bias of a neural network of the interpolation frame generating module 120 so that temporal-spatial loss information (p) is 0 (e.g., the first temporal-spatial feature (σ(Sgt)) is the same as the second temporal-spatial feature (σ(Ŝt)).
In describing
As described above, the learning module 110 may calculate the temporal-spatial loss information () between the first frame group 350 and the second frame group 360, based on the extracted temporal-spatial feature. For example, the first frame group 350 may include real photographed frames (the first frame 310, the second frame 320, and the first GT frame 330), and thus, the first temporal-spatial feature (F3D) may be a real value (a right answer) and the second frame group 360 may include the first interpolation frame 340 generated by the interpolation frame generating module 120, whereby the second temporal-spatial feature ({circumflex over (F)}3D) may be a prediction value. Therefore, the temporal-spatial loss information () may be an indicator representing the degree to which an interpolation frame generated by the interpolation frame generating module 120 expresses a temporal-spatial feature of a real frame, and the learning module 110 may correct a weight and/or a bias of a neural network NN of the interpolation frame generating module 120 so that the temporal-spatial loss information () converges to 0.
The 3D feature extraction module 510 may extract a complex feature including a temporal-spatial feature of a frame group. For example, the 3D feature extraction module 510 may additionally extract a reconstruction feature, a warping feature, and a distillation feature along with a temporal-spatial feature by using the first feature extraction model.
Also, the 3D loss calculation module 520 may more reflect loss caused by rec, warp, and dis to calculate complex loss information including the temporal-spatial loss information by using the first loss calculation function such as Equation 2.
rec represents reconstruction loss based on a difference between a reconstruction feature (obtained by inputting the first frame group 350 to the first feature extraction model) and a reconstruction feature (obtained by inputting the second frame group 360 to the first feature extraction model). warp represents warping loss based on a difference between a warping feature (obtained by inputting the first frame group 350 to the first feature extraction model) and a warping feature (obtained by inputting the second frame group 360 to the first feature extraction model). dis represents distillation loss based on a difference between a distillation feature (obtained by inputting the first frame group 350 to the first feature extraction model) and a distillation feature (obtained by inputting the second frame group 360 to the first feature extraction model). λ1, λ2, λ3, and λ4 may each be a relative priority of each loss, e.g., defined by a user. For example, loss information calculated by the first loss calculation function (Equation 2) may be complex loss information including temporal-spatial information loss. In at least one example, the first loss function may denote a function which calculates a difference between complex features including a temporal-spatial feature.
The first loss calculation function based on Equation 1 or Equation 2 may be at least one embodiment and is not limited thereto, and another loss function for calculating a difference between complex features including a temporal-spatial feature or a difference between temporal-spatial features or a combination thereof may be used as the first loss calculation function.
Referring to
The first frame group 350 may include at least three frames. As described above, the first frame group 350 may be because at least three frames are needed for extracting a temporal-spatial feature. Accordingly, the second frame group 360 described below may include at least three frames.
Referring to
Referring to
The learning module 110 of
As described above with reference to
The 2D feature extraction module 610 may extract a first spatial feature (F2D) of the first GT frame 330 and a second spatial feature ({circumflex over (F)}2D) of the first interpolation frame 340 by using a model (a second feature extraction model) for extracting a spatial feature. Here, the second feature extraction model may be a model which calculates a difference between pixel values of frames or may be a neural network model described below, but is not limited thereto.
For example, in a case where the 2D feature extraction module 610 uses a neural network model as the second feature extraction model so as to extract a spatial feature, the spatial feature may be a feature map of the neural network model which extracts the spatial feature. For example, the neural network model which extracts the spatial feature may be a visual geometry group (VGG), and the VGG may include a plurality of convolution layers. In this case, the feature map may correspond to input data of a convolution layer and a 2D convolution result of a filter. In a 2D convolution, a temporal-spatial feature of a frame may be lost, but a spatial feature may be maintained. Therefore, a feature map (or a 2D feature map) including a spatial feature may be output by the 2D convolution. Therefore, an operation of extracting the 2D feature map may be referred to as an operation of extracting the spatial feature.
The 2D feature extraction module 610 may extract a first spatial feature (F2D) of the first GT frame 330 and a second spatial feature ({circumflex over (F)}2D) of the first interpolation frame 340 by using the second feature extraction model. Also, the 2D loss calculation module 620 may calculate spatial loss information () based on a difference between the first spatial feature (F2D) and the second spatial feature ({circumflex over (F)}2D).
For example, in a case where the 2D feature extraction module 610 extracts the first spatial feature (F2D) and the second spatial feature ({circumflex over (F)}2D) by using the VGG, which is a neural network model, as the second feature extraction model, the 2D loss calculation module 620 may calculate spatial loss information () based on the following second loss calculation function (Equation 3).
In a case where the 2D feature extraction module 610 extracts a spatial feature of a frame by using the VGG, the second loss calculation function (Equation 3) may be a 2D perceptual loss function for calculating spatial loss information () based on a spatial feature difference (a difference between F2D and {circumflex over (F)}2D).
Here, Igt represents a GT frame, and Ît may represent an interpolation frame. Therefore, Igt may be a first GT frame 330, and Ît may be a first interpolation frame 340.
represents a 2D feature map including a spatial feature extracted from a fourth convolution layer of a fifth block of the VGG when the GT frame is input to the VGG. Here, the VGG may be a neural network including a plurality of blocks. Each of the plurality of blocks may include at least one convolution layer and/or max pooling layer. The fourth convolution layer of the fifth block of the VGG described below may correspond to a final convolution layer of a neural network. That is, ϕ(Igt)corresponds to a first spatial feature (F2D) of
ϕ(Ît) represents a 2D feature map including a spatial feature extracted from the fourth convolution layer of the fifth block of the VGG when the interpolation frame is input to the VGG. That is, {circumflex over (F)}2D may correspond to the second spatial feature ({circumflex over (F)}2D) of
The 2D loss calculation module 620 is configured to calculate spatial loss information () based on the second loss calculation function (Equation 3). Referring to Equation 3, vgg may be a norm for a difference between the first spatial feature (ϕ(Igt) and the second spatial feature (ϕ(Ît)). That is, vgg may be spatial loss information () based on a difference between the first spatial feature (F2D) and the second spatial feature ({circumflex over (F)}2D), and the learning module 110 may change a weight and/or a bias of a neural network of the interpolation frame generating module 120 so that spatial loss information () is 0.
To arrange the above description, the learning module 110 may calculate spatial loss information () between the first GT frame 330 and the first interpolation frame 340, based on the extracted spatial feature. For example, because the first GT frame 330 is a real photographed frame, the first spatial feature (F2D) may be a real value (a right answer) and the first interpolation frame 340 may be a frame generated by the interpolation frame generating module 120, and thus, the second spatial feature ({circumflex over (F)}2D) may be a prediction value. Therefore, the spatial loss information () may be an indicator representing the degree to which an interpolation frame generated by the interpolation frame generating module 120 expresses a spatial feature of a real frame, and the learning module 110 may correct a weight and/or a bias of a neural network NN of the interpolation frame generating module 120 so that the spatial loss information () converges to 0.
Although
An interpolation frame generating module 120 may generate a first interpolation frame 340 by using an interpolation model based on a neural network, based on a first frame 310 and a second frame 320.
The first interpolation frame 340 may be included in a second frame group 360 along with a first frame 310 and a second frame 320, and a first GT frame 330, and a first frame 310, a second frame 320, and a first GT frame 330 may be included in a first frame group 350.
The learning module 110 may calculate temporal-spatial loss information () between the first frame group 350 and the second frame group 360 by using a neural network model (for example, a first feature extraction model) which extracts a temporal-spatial feature. The 3D feature extraction module 510 may extract each of a first temporal-spatial
feature (F3D) of the first frame group 350 and a second temporal-spatial feature ({circumflex over (F)}3D) of the second frame group 360 by using the first feature extraction model (for example, the trained 3D CNN). The 3D loss calculation module 520 may calculate temporal-spatial loss information () by using a first loss calculation function (for example, Equation 1), based on a difference between the first temporal-spatial feature (F3D) and the second temporal-spatial feature ({circumflex over (F)}3D).
The learning module 110 may train the interpolation frame generating module 120 so as to decrease the calculated temporal-spatial loss information (). In other words, a weight and/or a bias of a neural network of an interpolation model may be changed so that the temporal-spatial loss information () is reduced.
Also, the learning module 110 may calculate spatial loss information () between the first GT frame 330 and the first interpolation frame 340 by using the 2D feature extraction module 610 and the 2D loss calculation module 620. That is, the 2D feature extraction module 610 may extract each of a first spatial feature (F2D) of the first GT frame 330 and a second spatial feature ({circumflex over (F)}2D) of the first interpolation frame 340 by using a second feature extraction model (for example, the VGG), and the 2D loss calculation module 620 may calculate spatial loss information () by using a second loss calculation function (for example, Equation 3), based on a difference between the first spatial feature (F2D) and the second spatial feature ({circumflex over (F)}2D).
The learning module 110 may train the interpolation frame generating module 120 so as to decrease spatial loss information () as well as temporal-spatial loss information (). For example, a weight and/or a bias of a neural network of an interpolation model may be changed.
Referring to
An interpolation frame generating module (120 of
A 3D feature extraction module (510 of
The 3D feature extraction module (510 of
The interpolation model may be trained based on a first temporal-spatial feature (F3D) and a second temporal-spatial feature ({circumflex over (F)}3D) in operation S140. In at least one embodiment, the steps S100 to S140 may be repeated at different times of the video such that the interpolation model is trained on a plurality of different frames representing different times during the video.
Referring to
The 2D feature extraction module (610 of
The learning module 110 may train the interpolation model, based on the first spatial
feature (F2D of
Referring to
The processor 1100 may include a connection path (for example, a bus) which transfers or receives a signal to or from other elements such as one or more cores (not shown) and a GPU (not shown) and/or a training database (130 of
The processor 1100 may perform an operation of a learning module (110 of
The processor 1100 may compare a temporal-spatial feature of a first frame group (350 of
In at least one embodiment, the processor 1100 may compare a spatial feature of a first interpolation frame (340 of
Furthermore, the processor 1100 may further include random access memory (RAM) (not shown) and read-only memory (ROM) (not shown), which temporarily and/or permanently store a signal (or data) processed by the processor 1100. Also, the processor 1100 may be implemented as a system on chip (SoC) type including at least one of a GPU, RAM, and ROM.
The memory 1200 may store programs (one or more instructions) for processing and control by the processor 1100. The memory 1200 may include the learning module 110 described above with reference to
The device 2000 may include an integrated circuit 2100 and elements (for example, a sensor 2200, a display device 2300, and a memory 2400) connected with the integrated circuit 2100. The device 2000 may be a device which processes data based on a neural network. For example, the device 2000 may include a data server or a mobile device such as a smartphone, a game machine, an advanced driver assistance system (ADAS), or a wearable device.
The integrated circuit 2100 according to at least one embodiment may include a CPU 2110, random access memory (RAM) 2120, a GPU 2130, a computing device 2140, a sensor interface 2150, a display interface 2160, and a memory interface 2170. Furthermore, the integrated circuit 2100 may further include other general-use elements such as a communication module, a digital signal processor (DSP), and/or a video module; and the CPU 2110, the RAM 2120, the GPU 2130, the computing device 2140, the sensor interface 2150, the display interface 2160, and the memory interface 2170 of the integrated circuit 2100 may transfer or receive data therebetween through a bus 2180. In at least one embodiment, the integrated circuit 2100 may be an AP. In at least one embodiment, the integrated circuit 2100 may be implemented as an SoC (system-on-a-chip).
The CPU 2110 may control the overall operation of the integrated circuit 2100. The CPU 2110 may include one processor core (single core), or may include a plurality of processor cores (multi-core). The CPU 2110 may process or execute data and/or instructions stored in the memory 2400. In at least one embodiment, the CPU 2110 may execute programs stored in the memory 2400 and may thus perform an interpolation model learning method according to embodiments.
The RAM 2120 may temporarily store programs, data, and/or instructions. According to at least one embodiment, the RAM 2120 may be implemented as dynamic RAM (DRAM) or static RAM (SRAM). The RAM 2120 may temporarily store data (for example, image data) which is input/output through the sensor interface 2150 and the display interface 2160, or is generated by the GPU 2130 or the CPU 2110.
In at least one embodiment, the integrated circuit 2100 may further include read-only memory (ROM). The ROM may store programs and/or data which are/is used continuously. The ROM may be implemented as erasable programmable ROM (EPROM) or electrically erasable programmable ROM (EEPROM).
The GPU 2130 may perform image processing on image data. For example, the GPU 2130 may perform image processing on the image data received through the sensor interface 2150. Image data obtained through processing by the GPU 2130 may be stored in the memory 2400, or may be provided to the display device 2300 through the display interface 2160.
The computing device 2140 may include an accelerator for performing an operation of a neural network. For example, the computing device 2140 may include an NPU. In at least one embodiment, the GPU 2130 or the computing device 2140 may perform an operation of the neural network in a learning process or a data recognition process of the neural network.
The sensor interface 2150 may receive data (for example, image data, sound data, etc.) input from the sensor 2200 connected with the integrated circuit 2100.
The display interface 2160 may output data (for example, an image) to the display device 2300. The display device 2300 may output image data or video data through a display such as a liquid crystal display (LCD) or active matrix organic light emitting diodes (AMOLED).
The memory interface 2170 may interface data, input from the memory 240 outside the integrated circuit 2100, or data output from the memory 2400. According to at least one embodiment, the memory 2400 may be implemented as a volatile memory, such as DRAM or SRAM, or a non-volatile memory such as resistive random access memory (ReRAM), phase-change random access memory (PRAM), or NAND flash. The memory 2400 may be implemented as a memory card (multimedia card (MMC), an embedded multi-media card (eMMC), a secure digital (SD) card, or a micro SD card).
The integrated circuit 2100 may be an AP, the memory 2400 may include a trained interpolation model (or an interpolation frame generating model) according to at least one embodiment described above with reference to
In at least one embodiment, the integrated circuit 2100 may generate training data and may learn the interpolation model based on the training data. For example, the integrated circuit 2100 may learn (or relearn) the interpolation model according to at least one embodiment described above with reference to
Hereinabove, embodiments have been described in the drawings and the specification. Embodiments have been described by using the terms described herein, but this has been merely used for describing the inventive concepts and has not been used for limiting a meaning or limiting the scope of the inventive concepts defined in the following claims. Therefore, it may be understood by those of ordinary skill in the art that various modifications and other equivalent embodiments may be implemented from the inventive concepts. Accordingly, the spirit and scope of the inventive concepts may be defined based on the spirit and scope of the following claims.
While the inventive concepts have been particularly shown and described with reference to embodiments thereof, it will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0178683 | Dec 2022 | KR | national |