Aspects of the present disclosure relate to super sampling, specifically aspects of the present disclosure relate to frame generation super sampling.
Computer graphics systems can generate images from geometric data. Computer graphics systems are commonly used in graphic-intensive applications such as video games. Recently, artificial intelligence (AI) has been applied to real-time rendering of graphics to construct sharp, higher resolution images. For example, Nvidia's Deep learning Super Sampling (DLSS) 3 implementation inserts a synthetic frame generated by a machine learning model between the last GPU-rendered fame and the one before that prior to display. This creates a lot of latency because the device must render the last frame and the prior frame and still results in variable frame rate even if it creates higher frame rates. Changes in the variable frame rate are very noticeable for users and is sometimes referred to as “hitching.” These changes in variable frame rate degrade the user experience as the user may experience frame rate slow-down in processing intensive application states while experiencing higher frame rates in low intensity processing tasks.
It is within this context that aspects of the present disclosure arise.
The teachings of the present disclosure can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
Although the following detailed description contains many specific details for the purposes of illustration, anyone of ordinary skill in the art will appreciate that many variations and alterations to the following details are within the scope of the disclosure. Accordingly, examples of embodiments of the disclosure described below are set forth without any loss of generality to, and without imposing limitations upon, the claimed disclosure.
A computer graphics system typically renders a frame as fast as computing power and software allows. With variable frame rate, monitor and computer can get out of synch causing a visual artifact known as “tearing”. In monitors with V-sync or similar capability connected to a computer, the monitor can tell the computer when it will render a frame or vice versa. Alternatively, the computer can render two real frames and artificially render an intermediate frame between them, e.g., using motion vectors. This is called variable render rate (VRR) rendering. VRR still has an issue with application latency as the render rate is hemmed by the creation speed of the last real frame. VRR does avoid screen tearing, but the result is not as good as rendering with a fixed frame rate and is quite noticeable to the user during render rate changes. To reach a fixed frame rate, systems are configured to reduce the overall rendering quality until you can reach the target fixed rate for both processing more intensive frames and less processing intensive frames.
According to aspects of the present disclosure, a computer graphics system may operate at a real fixed frame rate and generate one or more synthetic frames after each real frame or after each two or more real frames according to a synthetic frame insertion interval. Synthetic frames may be generated from the prior two real frames using algorithmic frame generation or neural network models trained with a machine learning algorithm to predict synthetic frames. A subsequent frame may be generated after the synthetic frame and then displayed. In some alternative implementations the synthetic frame may be created using a last frame and motion vectors generated from images in an image stream. This improved frame generation may be unconstrained by the generation speed of the last real frame as the synthetic frames are generated after the last real frame and before creation of a subsequent real frame.
According to aspects of the present disclosure, improved frame generation for super sampling may be implemented with an autoencoder type neural network layout having encoder networks which take part in dimensional reduction outputting image frame embeddings and a decoder which predicts a synthetic subsequent frame using the image frame embeddings collectively referred to as an auto-encoder neural network. The auto-encoder neural network outputs feature length image embeddings and the decoder includes a neural network for recognition that uses those feature length image embeddings to generate one or more synthetic next frames. The auto-encoder may also be configured to take motion vector and/or user input information. The motion vector and/or user input information may be used in conjunction with the frame data to generate the image embeddings.
Each neural network used in the autoencoder may be any type known in the art but preferably the neural network is a Convolutional Neural Network (CNN). In an alternative embodiment the CNN is a Convolutional Recurrent Neural Network (CRNN) of any type.
As shown in
Training a neural network (NN) begins with initialization 201 of the weights of the NN (shown as U and V in
where n is the number of inputs to the node.
After initialization an image frame from an image stream is provided to an encoder neural network 202. Exemplary image stream types are without limitation: video streams, time lapse photography, slow motion photography etc. In some implementations, motion vectors may be provided with the image frame. Motion vector data may be concatenated to the end of the image frames or provided to a separate network which generates a motion embedding. The motion embeddings may be concatenated to the image frames embeddings. Alternatively, two or more images may be provided to the encoder neural network. These images may be concatenated together and used as a single array in the input space of the encoder. The two or more images may include a last image frame generated by the renderer and one or more image frame previous to the last image frame generated by the device.
In some optional implementations user inputs are also provided to the encoder neural network 208. The user inputs may be concatenated to the image frames in the encoder input space. Alternatively, a separate user input network may generate user input embeddings and the user input embeddings may be concatenated with image frame embeddings and used as inputs to the decoder.
An auto-encoder includes a neural network trained using a method called unsupervised learning. In unsupervised learning an encoder NN is provided with a decoder NN counterpart, and the encoder and decoder are trained together as a single unit. The basic function of an auto-encoder is to take an input x which is an element of Rd and map it to a representation h which is an element Rd′ this mapped representation may also be referred to as the image vector. A deterministic function of the type h=ƒθ=σ(Wχ+b) with the parameters θ={W, b} is used to create the image vector. A decoder NN is then employed to reconstruct the input from the representative image vector by a reverse of ƒ: y=ƒθ; (h)=σ(W′h+b′) with θ′={W′, b′} the two parameters' sets may be constrained to the form of W′=WT using the same weights for encoding the input and decoding the representation. Each training input χi is mapped to its image vector hi and its reconstruction yi. These parameters are trained by minimizing an appropriate cost function over a training set. A convolutional auto encoder works similar to a basic auto-encoder except that the weights are shared across all of the locations of the inputs. Thus, for a mono channel input (such as a black and white image) x, the representation of the k-th feature map is given by hk=σ(x*Wk+bk) where the bias is broadcasted to the whole map. Variable σ represents an activation function, b represents a single bias which is used per latent map, W represents a weight shared across the map, and * is a 2D convolution operator. To reconstruct the input the formula is
Where there is one bias C per input channel, h identifies the group of feature maps and W identifies the flip operation over both dimensions and weights. Further information about training and weighting of a convolutional auto encoder can be found in Masci et al. “Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction” In ICANN, pages 52-59. 2011.
As discussed above the auto encoder maps an input χi to its corresponding representative image embedding hi 203 those image embeddings are then provided to a decoder 204. According to aspects of the present disclosure the encoder and decoder are configured to predict the one or more next frames in the input image stream 205 instead of reconstructing the original input. The output of the decoder according to aspects of the present disclosure is modified from the traditional auto-encoder case by having the network output more channels than it was input. By way of example and not by way of limitation if an RGB 3 channel image of size 100×100 was input the output would be a b x3×100×100 where b in the number of channels corresponding to all the timestamps in the interval {t+w}. That is, given an input image F at time t the encoder and decoder is trained to predict the images in the interval F{t+w} where w is a prediction interval. In other words, the training ideal would be to minimize the mean squared error between a synthetic subsequent image F′t+w and the actual subsequent image Ft+w in an image stream {F1, F2, F3 . . . . FT} respectively giving the equation:
Thus, according to aspects of the present disclosure the encoder NN, &, generates a k-dimensional vector Φ such that Φ=ε(Ft) and the decoder D generates predicted subsequent images Ft+w from the k-th dimensional vector Φ giving:
Where the decoder is a convolution network with up sampling layers that converts the k-dimensional vector Φ to the output image sequence. Training and optimization of this decoder and encoder neural network system takes advantage of the fact that the training set is an image stream and as such the previous and subsequent images are readily available. The outputs of the decoder, i.e., the predicted subsequent images are compared with the corresponding actual subsequent images from the image stream 206. In other words, the predicted image F′t+w is compared to actual image at time t+w. The difference between the actual image and the predicted image is applied to a loss function such as a mean squared error function, cross entropy loss function etc. The NN is then optimized and trained, as indicated at 207, using the result of the loss function and using known methods of training for neural networks such as backpropagation with stochastic gradient descent.
After many rounds of training the encoder and decoder neural network output correctly predicts subsequent images in an image stream and the loss function has stabilized.
In some embodiments the predictions are not limited to an immediate next image in the image stream. The decoder system may be trained to predict any number of images before or after the input image. In some alternative embodiments the input to the encoder and decoder NN may be optimized by selecting certain image points for the NNs to work on or encoding the video using conventional video encoding methods such as, MPEG-2 encoding or H.264 encoding. According to additional aspects of the present disclosure additional tuning of the encoder and decoder system may be carried out by hand manipulation of the number of nodes in a layer or by changing the activation function. Tuning of a convolutional NN may also be performed by changing the weights of certain mask, the sizes of the masks and the number of channels. In fully connected networks the number of hidden units could be changed to tune the network.
As discussed above the encoder 301 generates a representative image embedding 302 which is provided to the decoder 303. The decoder 303 then constructs a synthetic next image Ft+1 305 from the representative image embedding 302 and in some implementations the decoder may also predict synthetic second next image Ft+2 304. The predictions of a synthetic next image 305, and (optionally) a synthetic second next image 304 are checked through comparison 310 with the actual next image 307 and optionally original second next image 309 respectively. As shown the predicted subsequent image 305 differs from the actual subsequent image 307 as the face is sticking out its tongue in the actual image but is not in the predicted image therefore more training would be needed to reach a correct result. The result of the comparison is then used for training and optimization of the encoder decoder system as described above. The encoder decoder system is considered to be fully trained when the loss function does not change very much with variation in parameters.
Once trained the encoder-decoder pair may be used to output a synthetic image frame that may be displayed after a last image frame generated by the system in an image stream. For example, and without limitation, the system may generate a previous real image frame and a first real image frame and then implement the encoder-decoder pair to generate a synthetic second image and (optionally) a synthetic third image. The synthetic second image may be displayed after the first real image and before generation of a second real image by the system Here generation of the real images includes the use of one or more stages in the traditional graphics processing pipeline such as primitive generation, graphics processing pipeline steps such as input assembly, vertex shading, hull shading, tessellation, domain shading, geometry shading, rasterization, and pixel shading. Whereas in aspects of the present disclosure synthetic images are generated by machine learning or algorithmically without use of stages in the traditional graphics processing pipeline.
Finally, the last user input to an input device 404 may (optionally) be provided to the encoder NN 405. As discussed above the last user input to the input device may be made contemporaneously with the generation of the input last image or shortly before the last image frame is generated. The encoder NN 405 is trained with a machine learning algorithm as discussed with
Algorithmic Frame Generation uses an algorithm instead of machine learning and neural networks to generate a synthetic frame. According to aspects of the present disclosure, the Algorithmic frame generation may use either motion vectors or a real previous frame with the real last frame. Using the real previous frame, the system may identify blocks or macroblocks within the image that have not changed between the real previous frame and the real last frame. The unchanged macroblocks may be passed to the synthetic image without change.
Next the system may identify motion between the real previous frame and the real last frame by searching the real last frame for blocks or macro blocks or sub macroblocks that match the real previous frame, this is sometimes referred to as motion search. The magnitude and direction of the shift in matching block or macro block or sub macroblock may then be used to predict the location of the matching block or macro block or sub macroblock in the synthetic next image frame based on the assumption that movement will generally continue in the direction and magnitude of the motion vector in the short time between frames. In some implementations the motion vectors may already be computed and in which case no additional motion search is required.
Once the location of the matching block or macro block or sub macroblock has been determined, pixel values in the determined location may be overwritten with values of the matching block or macro block or sub macroblock. In areas where the matching block or macro block or sub macroblock leaves a space, values may be copied from the real previous image if the matching block or macro block or sub macroblock was not in the area, or the values may be copied from the nearest neighboring pixel values.
The computing device 600 may include one or more processor units 603, which may be configured according to well-known architectures, such as, e.g., single-core, dual-core, quad-core, multi-core, processor-coprocessor, cell processor, and the like. The computing device may also include one or more memory units 604 (e.g., random access memory (RAM), dynamic random-access memory (DRAM), read-only memory (ROM), and the like).
The processor unit 603 may execute one or more programs, portions of which may be stored in the memory 604 and the processor 603 may be operatively coupled to the memory, e.g., by accessing the memory via a data bus 605. The programs may be configured to implement training of an Encoder 608 and a Decoder 622. The Memory 604 may also contain software modules such as an Encoder Module 608, Decoder Module 622, and/or Algorithmic Frame Generation Module 609. The Algorithmic Frame Generation Module may generate frames using motion vectors which may be generated from the image stream 621 as discussed above. The overall structure and probabilities of the NNs may also be stored as data 618 in the Mass Store 615. The processor unit 603 is further configured to execute one or more programs 617 stored in the mass store 615 or in memory 604 which cause processor to carry out the method 200 of training the Encoder 608 and Decoder 622 from an image stream 621 and/or the training an image recognition NN from image embeddings 610. The system may generate Neural Networks as part of the NN training process. These Neural Networks may be stored in memory 604 as part of the Encoder Module 608, Decoder Module 622. Completed NNs may be stored in memory 604 or as data 618 in the mass store 615. The programs 617 (or portions thereof) may also be configured, e.g., by appropriate programming, to encode, un-encoded video or manipulate one or more images in an image stream stored in a buffer in the memory 604.
The computing device 600 may also include well-known support circuits, such as input/output (I/O) 607, circuits, power supplies (P/S) 611, a clock (CLK) 612, and cache 613, which may communicate with other components of the system, e.g., via the data bus 605. The computing device may include a network interface 614. The processor unit 603 and network interface 614 may be configured to implement a local area network (LAN) or personal area network (PAN), via a suitable network protocol, e.g., Bluetooth, for a PAN. The computing device may optionally include a mass storage device 615 such as a disk drive, CD-ROM drive, tape drive, flash memory, or the like, and the mass storage device may store programs and/or data. The computing device may also include a user interface 616 to facilitate interaction between the system and a user. The user interface may include a display device, e.g., a monitor, flat screen, or other audio-visual device.
The network interface 614 to facilitate communication via an electronic communications network 620. The network interface 614 may be configured to implement wired or wireless communication over local area networks and wide area networks such as the Internet. The device 600 may send and receive data and/or requests for files via one or more message packets over the network 620. Message packets sent over the network 620 may temporarily be stored in a buffer in memory 604.
While the above is a complete description of the preferred embodiment of the present invention, it is possible to use various alternatives, modifications, and equivalents. Therefore, the scope of the present invention should be determined not with reference to the above description but should, instead, be determined with reference to the appended claims, along with their full scope of equivalents. Any feature described herein, whether preferred or not, may be combined with any other feature described herein, whether preferred or not. In the claims that follow, the indefinite article “A”, or “An” refers to a quantity of one or more of the item following the article, except where expressly stated otherwise. The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase “means for.”