GENERATION SUPER SAMPLING

FIELD OF THE DISCLOSURE

Aspects of the present disclosure relate to super sampling, specifically aspects of the present disclosure relate to frame generation super sampling.

BACKGROUND OF THE DISCLOSURE

Computer graphics systems can generate images from geometric data. Computer graphics systems are commonly used in graphic-intensive applications such as video games. Recently, artificial intelligence (AI) has been applied to real-time rendering of graphics to construct sharp, higher resolution images. For example, Nvidia's Deep learning Super Sampling (DLSS) 3 implementation inserts a synthetic frame generated by a machine learning model between the last GPU-rendered fame and the one before that prior to display. This creates a lot of latency because the device must render the last frame and the prior frame and still results in variable frame rate even if it creates higher frame rates. Changes in the variable frame rate are very noticeable for users and is sometimes referred to as “hitching.” These changes in variable frame rate degrade the user experience as the user may experience frame rate slow-down in processing intensive application states while experiencing higher frame rates in low intensity processing tasks.

It is within this context that aspects of the present disclosure arise.

BRIEF DESCRIPTION OF THE DRAWINGS

The teachings of the present disclosure can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:

FIG. 1A is a simplified node diagram of a recurrent neural network for improved frame generation super sampling according to aspects of the present disclosure.

FIG. 1B is a simplified node diagram of an unfolded recurrent neural network for improved frame generation super sampling according to aspects of the present disclosure.

FIG. 1C is a simplified diagram of a convolutional neural network for improved frame generation super sampling according to aspects of the present disclosure.

FIG. 2 is a flow diagram showing a method for training an auto-encoder network to generate synthetic image frames for improved frame generation super sampling according to aspects of the present disclosure.

FIG. 3 depicts a block diagram of the encoder-decoder pair being trained via unsupervised training according to aspects of the present disclosure.

FIG. 4 graphically depicts the improved frame generation super sampling method using an autoencoder trained with a machine learning algorithm according to aspects of the present disclosure.

FIG. 5 is a flow diagram showing a method for improved frame generation super sampling according to aspects of the present disclosure.

FIG. 6 depicts a system for improved frame generation super sampling according to aspects of the present disclosure.

DESCRIPTION OF THE SPECIFIC EMBODIMENTS

Although the following detailed description contains many specific details for the purposes of illustration, anyone of ordinary skill in the art will appreciate that many variations and alterations to the following details are within the scope of the disclosure. Accordingly, examples of embodiments of the disclosure described below are set forth without any loss of generality to, and without imposing limitations upon, the claimed disclosure.

A computer graphics system typically renders a frame as fast as computing power and software allows. With variable frame rate, monitor and computer can get out of synch causing a visual artifact known as “tearing”. In monitors with V-sync or similar capability connected to a computer, the monitor can tell the computer when it will render a frame or vice versa. Alternatively, the computer can render two real frames and artificially render an intermediate frame between them, e.g., using motion vectors. This is called variable render rate (VRR) rendering. VRR still has an issue with application latency as the render rate is hemmed by the creation speed of the last real frame. VRR does avoid screen tearing, but the result is not as good as rendering with a fixed frame rate and is quite noticeable to the user during render rate changes. To reach a fixed frame rate, systems are configured to reduce the overall rendering quality until you can reach the target fixed rate for both processing more intensive frames and less processing intensive frames.

According to aspects of the present disclosure, a computer graphics system may operate at a real fixed frame rate and generate one or more synthetic frames after each real frame or after each two or more real frames according to a synthetic frame insertion interval. Synthetic frames may be generated from the prior two real frames using algorithmic frame generation or neural network models trained with a machine learning algorithm to predict synthetic frames. A subsequent frame may be generated after the synthetic frame and then displayed. In some alternative implementations the synthetic frame may be created using a last frame and motion vectors generated from images in an image stream. This improved frame generation may be unconstrained by the generation speed of the last real frame as the synthetic frames are generated after the last real frame and before creation of a subsequent real frame.

General Neural Network Training

According to aspects of the present disclosure, improved frame generation for super sampling may be implemented with an autoencoder type neural network layout having encoder networks which take part in dimensional reduction outputting image frame embeddings and a decoder which predicts a synthetic subsequent frame using the image frame embeddings collectively referred to as an auto-encoder neural network. The auto-encoder neural network outputs feature length image embeddings and the decoder includes a neural network for recognition that uses those feature length image embeddings to generate one or more synthetic next frames. The auto-encoder may also be configured to take motion vector and/or user input information. The motion vector and/or user input information may be used in conjunction with the frame data to generate the image embeddings.

Each neural network used in the autoencoder may be any type known in the art but preferably the neural network is a Convolutional Neural Network (CNN). In an alternative embodiment the CNN is a Convolutional Recurrent Neural Network (CRNN) of any type.

FIG. 1A depicts the basic form of a CNN having a layer of nodes 120, each of which is characterized by an activation function S, one input weight U, and an output transition weight V. It should be noted that the activation function S may be any non-linear function known in the art and is not limited to the (hyperbolic tangent (tanh) function. The input weight U and output transition weights V may be random before training and altered during training based on a selected machine learning algorithm. For example, the activation function S may be a Sigmoid or ReLu function. A feature of CNNs is the addition of hidden layers 121 to the output of a first node layer 120 creating a hierarchical type structure. A CNN may have any number of hidden layers and the hidden layers may have different activation function H and output transition weight Z than the initial node layer 120. The inputs to the initial node layer 1, 2, 3, 4, 5, may be monochromatic be pixel values from an input image frame. There may be an input layer for each color channel (e.g. for Red Green Blue color mix there may be 3 input layers one for each color channel) of the image as discussed below. Other inputs such as user inputs to an input device may be concatenated to the input space of one or more of the color channels or a separate input layer may be created for that other input type. Additionally, the initial node layer 120 in a CNN is not fully connected across all inputs and instead operates on a window of inputs as discussed in FIG. 1C. Operating on only a window of inputs allows the CNN to better maintain spatial information than a traditional fully connected NN. Later hidden layers may combine outputs from all nodes in a window of the previous layer; this type of layer is known as a pooling layer. A max pooling layer is another specialized hidden layer similar to a pooling layer except it only values the highest weighted input from the previous layer within a window. The design of CNNs is highly dependent on the purpose for which is chosen and the number of nodes, layers, layer types and activation functions are variable and require optimization through experimentation.

As shown in FIG. 1B a recurrent neural network layer adds a recurrent node transition W which returns the value from the previous iteration to the current node. Thus, a recurrent layer may be considered as a series of nodes 120 having the same activation function moving through time T and T+1. Thus, the RNN maintains historical information by feeding the result from a previous time T to a current time T+1. A CNN may implement node layers of the recurrent type. Another type of RNN that may be used is a Long Short-Term Memory (LSTM) Neural Network which adds a memory block in a RNN node with input gate activation function, output gate activation function and forget gate activation function resulting in a gating memory that allows the network to retain some information for a longer period of time as described by Hochreiter & Schmidhuber “Long Short-term memory” Neural Computation 9 (8): 1735-1780 (1997)

FIG. 1C depicts an example layout of a convolution neural network for image recognition. In this depiction the convolution neural network is generated for an image 132 with a size of 4 units in height and 4 units in width giving a total area of 16 units. The depicted convolutional neural network has a filter 133 size of 2 units in height and 2 units in width with a skip value of 1 and a channel 136 size of 9. (For clarity in depiction only the connections 134 between the first column of channels and their filter windows is depicted.) The convolutional neural network according to aspects of the present disclosure may have any number of additional neural network node layers 131 and may include such layer types as additional convolutional layers, fully connected layers, pooling layers, max pooling layers, local contrast normalization layers, etc. of any size.

Autoencoder Training

FIG. 2 shows a method for training an auto-encoder network to generate synthetic image frames.

Training a neural network (NN) begins with initialization 201 of the weights of the NN (shown as U and V in FIG. 1A). The initial weights depend on the type of activation function and number of inputs to the node. The initial weights for an NN cannot be 0 because that would result in asymmetric calculations in the hidden layers. In general, the initial weights should be distributed randomly. For example, an NN with a tanh activation function should have random values distributed between

$- \frac{1}{\sqrt{n}} and \frac{1}{\sqrt{n}}$

where n is the number of inputs to the node.

After initialization an image frame from an image stream is provided to an encoder neural network 202. Exemplary image stream types are without limitation: video streams, time lapse photography, slow motion photography etc. In some implementations, motion vectors may be provided with the image frame. Motion vector data may be concatenated to the end of the image frames or provided to a separate network which generates a motion embedding. The motion embeddings may be concatenated to the image frames embeddings. Alternatively, two or more images may be provided to the encoder neural network. These images may be concatenated together and used as a single array in the input space of the encoder. The two or more images may include a last image frame generated by the renderer and one or more image frame previous to the last image frame generated by the device.

In some optional implementations user inputs are also provided to the encoder neural network 208. The user inputs may be concatenated to the image frames in the encoder input space. Alternatively, a separate user input network may generate user input embeddings and the user input embeddings may be concatenated with image frame embeddings and used as inputs to the decoder.

An auto-encoder includes a neural network trained using a method called unsupervised learning. In unsupervised learning an encoder NN is provided with a decoder NN counterpart, and the encoder and decoder are trained together as a single unit. The basic function of an auto-encoder is to take an input x which is an element of R^dand map it to a representation h which is an element R^d′ this mapped representation may also be referred to as the image vector. A deterministic function of the type h=ƒ_θ=σ(W_χ+b) with the parameters θ={W, b} is used to create the image vector. A decoder NN is then employed to reconstruct the input from the representative image vector by a reverse of ƒ: y=ƒ_θ; (h)=σ(W′h+b′) with θ′={W′, b′} the two parameters' sets may be constrained to the form of W′=WT using the same weights for encoding the input and decoding the representation. Each training input χ_iis mapped to its image vector hi and its reconstruction y_i. These parameters are trained by minimizing an appropriate cost function over a training set. A convolutional auto encoder works similar to a basic auto-encoder except that the weights are shared across all of the locations of the inputs. Thus, for a mono channel input (such as a black and white image) x, the representation of the k-th feature map is given by h^k=σ(x*W^k+b^k) where the bias is broadcasted to the whole map. Variable σ represents an activation function, b represents a single bias which is used per latent map, W represents a weight shared across the map, and * is a 2D convolution operator. To reconstruct the input the formula is

$y = σ (\sum_{k \in H} h^{k} * {\tilde{W}}^{k} + C)$

Where there is one bias C per input channel, h identifies the group of feature maps and W identifies the flip operation over both dimensions and weights. Further information about training and weighting of a convolutional auto encoder can be found in Masci et al. “Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction” In ICANN, pages 52-59. 2011.

As discussed above the auto encoder maps an input χ_ito its corresponding representative image embedding hi 203 those image embeddings are then provided to a decoder 204. According to aspects of the present disclosure the encoder and decoder are configured to predict the one or more next frames in the input image stream 205 instead of reconstructing the original input. The output of the decoder according to aspects of the present disclosure is modified from the traditional auto-encoder case by having the network output more channels than it was input. By way of example and not by way of limitation if an RGB 3 channel image of size 100×100 was input the output would be a b x3×100×100 where b in the number of channels corresponding to all the timestamps in the interval {t+w}. That is, given an input image F at time t the encoder and decoder is trained to predict the images in the interval F{_t+w} where w is a prediction interval. In other words, the training ideal would be to minimize the mean squared error between a synthetic subsequent image F′_t+wand the actual subsequent image F_t+win an image stream {F₁, F₂, F₃. . . . F_T} respectively giving the equation:

$\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} {(F_{t + w}^{'} - F_{t + w})}^{2} & EQ . 1 \end{matrix}$

Thus, according to aspects of the present disclosure the encoder NN, &, generates a k-dimensional vector Φ such that Φ=ε(F_t) and the decoder D generates predicted subsequent images F_t+wfrom the k-th dimensional vector Φ giving:

$\begin{matrix} {{Ft}_{+ w}} = D (Φ) = D (ε (F_{t})) & EQ . 2 \end{matrix}$

Where the decoder is a convolution network with up sampling layers that converts the k-dimensional vector Φ to the output image sequence. Training and optimization of this decoder and encoder neural network system takes advantage of the fact that the training set is an image stream and as such the previous and subsequent images are readily available. The outputs of the decoder, i.e., the predicted subsequent images are compared with the corresponding actual subsequent images from the image stream 206. In other words, the predicted image F′_t+wis compared to actual image at time t+w. The difference between the actual image and the predicted image is applied to a loss function such as a mean squared error function, cross entropy loss function etc. The NN is then optimized and trained, as indicated at 207, using the result of the loss function and using known methods of training for neural networks such as backpropagation with stochastic gradient descent.

After many rounds of training the encoder and decoder neural network output correctly predicts subsequent images in an image stream and the loss function has stabilized.

In some embodiments the predictions are not limited to an immediate next image in the image stream. The decoder system may be trained to predict any number of images before or after the input image. In some alternative embodiments the input to the encoder and decoder NN may be optimized by selecting certain image points for the NNs to work on or encoding the video using conventional video encoding methods such as, MPEG-2 encoding or H.264 encoding. According to additional aspects of the present disclosure additional tuning of the encoder and decoder system may be carried out by hand manipulation of the number of nodes in a layer or by changing the activation function. Tuning of a convolutional NN may also be performed by changing the weights of certain mask, the sizes of the masks and the number of channels. In fully connected networks the number of hidden units could be changed to tune the network.

FIG. 3 depicts a block diagram of the encoder 301 that is trained via unsupervised training with a decoder 303. The encoder is provided, an input last image 306 and, in this implementation, a previous image 308 from an image stream 309. The image stream 309 contains previous images 308 and/or subsequent images T+1 307 and T+2 311. Additionally in this implementation a user input 312 to an input device is appended to the encoder inputs. The user input here may be made contemporaneously (e.g., within 15 ms) with the generation of the input last image or shortly before (e.g. greater than 15 ms display but less than two seconds before) the last image frame is generated. The idea being that the user input is tied to generation of the image frame; e.g., when a player in a game presses the jump button, the next few frames will be jump animation frames which can be learned by the encoder. The user input may be one or more button presses, or movement information from the input device, the input device may be a game controller, keyboard, mouse, trackball, joystick, Inertial Measurement Unit (IMU), etc. The user input may be a single input or combination of inputs made by the user. The combination of inputs may be determined by a threshold time between inputs, if the user makes another input before the threshold time period has elapsed the input may be entered as an element in the combination of inputs. In some alternative implementations instead of including a previous image, motion vectors for the input last image may be provided as an input.

As discussed above the encoder 301 generates a representative image embedding 302 which is provided to the decoder 303. The decoder 303 then constructs a synthetic next image F_t+1305 from the representative image embedding 302 and in some implementations the decoder may also predict synthetic second next image F_t+2304. The predictions of a synthetic next image 305, and (optionally) a synthetic second next image 304 are checked through comparison 310 with the actual next image 307 and optionally original second next image 309 respectively. As shown the predicted subsequent image 305 differs from the actual subsequent image 307 as the face is sticking out its tongue in the actual image but is not in the predicted image therefore more training would be needed to reach a correct result. The result of the comparison is then used for training and optimization of the encoder decoder system as described above. The encoder decoder system is considered to be fully trained when the loss function does not change very much with variation in parameters.

Once trained the encoder-decoder pair may be used to output a synthetic image frame that may be displayed after a last image frame generated by the system in an image stream. For example, and without limitation, the system may generate a previous real image frame and a first real image frame and then implement the encoder-decoder pair to generate a synthetic second image and (optionally) a synthetic third image. The synthetic second image may be displayed after the first real image and before generation of a second real image by the system Here generation of the real images includes the use of one or more stages in the traditional graphics processing pipeline such as primitive generation, graphics processing pipeline steps such as input assembly, vertex shading, hull shading, tessellation, domain shading, geometry shading, rasterization, and pixel shading. Whereas in aspects of the present disclosure synthetic images are generated by machine learning or algorithmically without use of stages in the traditional graphics processing pipeline.

Autoencoder Based Improved Frame Generation Super Sampling

FIG. 4 graphically depicts the improved frame generation super sampling method using an autoencoder trained with a machine learning algorithm according to aspects of the present disclosure. In the implementation shown the trained encoder neural network 405 is provided with a real first image frame 402 from an image stream generated by the system using one or more stages of a traditional graphics pipeline. Additionally, the encoder neural network is provided with motion vectors 403 generated from the first real image frame and the image stream. Each image frame is composed of an array of pixels, these pixels may be grouped together in for example and without limitation 4-pixel by 4-pixel (16 Pixels total) blocks and further grouped into 16-pixels by 16-pixel macro blocks (a 4×4 block square). Additionally, a sub macro block grouping may be for example and without limitation a 2-block by 2-block group. Motion vectors may be created by searching the pictures in the image stream for matching blocks or macroblocks and generating a vector value based on the shift of the matching blocks in the image.

Finally, the last user input to an input device 404 may (optionally) be provided to the encoder NN 405. As discussed above the last user input to the input device may be made contemporaneously with the generation of the input last image or shortly before the last image frame is generated. The encoder NN 405 is trained with a machine learning algorithm as discussed with FIG. 2 and FIG. 3 to generate an image embedding. The decoder NN 406 is closely tied with the encoder NN 405 and takes the image embedding and predicts a synthetic second image 407 the synthetic second image is the image that comes after the first image in the image stream or an image in between the real first image and real second image 410 but the real second image has not been generated by the graphics pipeline yet 409. In some implementations the decoder NN may generate one or more additional synthetic images from the image embedding. These additional images may be displayed before the real next image. After displaying the one or more synthetic next images the real next image 410 may be displayed. The real next image 410 may be generated by one or more stages of a graphics pipeline 409 while one or more of the synthetic next images are being generated or after the one or more synthetic next images have been generated. Generally, generation of the real next image will take longer than generation of synthetic images using the Auto-encoder NNs. In some (optional) implementations display of the real next image may be delayed 411 until the one or more synthetic next images have been displayed. This may prevent injection of the real next image between two synthetic images.

FIG. 5 is a flow diagram showing a method for improved frame generation super sampling according to aspects of the present disclosure. In this implementation a real previous image frame and a real last image frame generated by one or more stages of the graphics pipeline may be provided to the encoder NN 501. Optionally one or more last user inputs to an input device may also be provided to the encoder NN 507. The encoder NN then generates image frame embeddings from the real previous image frame and a real last image frame and optionally the one or more last user inputs 502. The image frame embeddings are then provided to the trained decoder NN 503. The trained decoded NN predicts one or more synthetic next image frames from the image frame embeddings 504. The synthetic next image is then provided to the frame buffer 505 where it may be provided to a display device and the synthetic next image frame may be displayed on the screen of the display device 506. After the one or more synthetic next image frames are displayed a real next image frame generated by one or more stages of the graphics pipeline may be displayed 508. In some alternative implementations the pixel values of the synthetic next image frame may be checked against the pixel values of the real next image frame and if they match or if the difference between the pixel values falls below a threshold the real next image frame may not be displayed. The real next image frame may also be provided 510 as input to the encoder NN as the real last image 501 to continue the cycle of synthetic frame generation.

Algorithmic Frame Generation

Algorithmic Frame Generation uses an algorithm instead of machine learning and neural networks to generate a synthetic frame. According to aspects of the present disclosure, the Algorithmic frame generation may use either motion vectors or a real previous frame with the real last frame. Using the real previous frame, the system may identify blocks or macroblocks within the image that have not changed between the real previous frame and the real last frame. The unchanged macroblocks may be passed to the synthetic image without change.

Next the system may identify motion between the real previous frame and the real last frame by searching the real last frame for blocks or macro blocks or sub macroblocks that match the real previous frame, this is sometimes referred to as motion search. The magnitude and direction of the shift in matching block or macro block or sub macroblock may then be used to predict the location of the matching block or macro block or sub macroblock in the synthetic next image frame based on the assumption that movement will generally continue in the direction and magnitude of the motion vector in the short time between frames. In some implementations the motion vectors may already be computed and in which case no additional motion search is required.

Once the location of the matching block or macro block or sub macroblock has been determined, pixel values in the determined location may be overwritten with values of the matching block or macro block or sub macroblock. In areas where the matching block or macro block or sub macroblock leaves a space, values may be copied from the real previous image if the matching block or macro block or sub macroblock was not in the area, or the values may be copied from the nearest neighboring pixel values.

System for Improve Frame Generation Super Sampling

FIG. 6 depicts a system according to aspects of the present disclosure. The system may include a computing device 600 coupled to a user input device 602. The user input device 602 may be a controller, touch screen, microphone, keyboard, mouse, Joystick, Motion control device (IMU) or other device that allows the user to input speech data into the system.

The computing device 600 may include one or more processor units 603, which may be configured according to well-known architectures, such as, e.g., single-core, dual-core, quad-core, multi-core, processor-coprocessor, cell processor, and the like. The computing device may also include one or more memory units 604 (e.g., random access memory (RAM), dynamic random-access memory (DRAM), read-only memory (ROM), and the like).

The processor unit 603 may execute one or more programs, portions of which may be stored in the memory 604 and the processor 603 may be operatively coupled to the memory, e.g., by accessing the memory via a data bus 605. The programs may be configured to implement training of an Encoder 608 and a Decoder 622. The Memory 604 may also contain software modules such as an Encoder Module 608, Decoder Module 622, and/or Algorithmic Frame Generation Module 609. The Algorithmic Frame Generation Module may generate frames using motion vectors which may be generated from the image stream 621 as discussed above. The overall structure and probabilities of the NNs may also be stored as data 618 in the Mass Store 615. The processor unit 603 is further configured to execute one or more programs 617 stored in the mass store 615 or in memory 604 which cause processor to carry out the method 200 of training the Encoder 608 and Decoder 622 from an image stream 621 and/or the training an image recognition NN from image embeddings 610. The system may generate Neural Networks as part of the NN training process. These Neural Networks may be stored in memory 604 as part of the Encoder Module 608, Decoder Module 622. Completed NNs may be stored in memory 604 or as data 618 in the mass store 615. The programs 617 (or portions thereof) may also be configured, e.g., by appropriate programming, to encode, un-encoded video or manipulate one or more images in an image stream stored in a buffer in the memory 604.

The computing device 600 may also include well-known support circuits, such as input/output (I/O) 607, circuits, power supplies (P/S) 611, a clock (CLK) 612, and cache 613, which may communicate with other components of the system, e.g., via the data bus 605. The computing device may include a network interface 614. The processor unit 603 and network interface 614 may be configured to implement a local area network (LAN) or personal area network (PAN), via a suitable network protocol, e.g., Bluetooth, for a PAN. The computing device may optionally include a mass storage device 615 such as a disk drive, CD-ROM drive, tape drive, flash memory, or the like, and the mass storage device may store programs and/or data. The computing device may also include a user interface 616 to facilitate interaction between the system and a user. The user interface may include a display device, e.g., a monitor, flat screen, or other audio-visual device.

The network interface 614 to facilitate communication via an electronic communications network 620. The network interface 614 may be configured to implement wired or wireless communication over local area networks and wide area networks such as the Internet. The device 600 may send and receive data and/or requests for files via one or more message packets over the network 620. Message packets sent over the network 620 may temporarily be stored in a buffer in memory 604.

While the above is a complete description of the preferred embodiment of the present invention, it is possible to use various alternatives, modifications, and equivalents. Therefore, the scope of the present invention should be determined not with reference to the above description but should, instead, be determined with reference to the appended claims, along with their full scope of equivalents. Any feature described herein, whether preferred or not, may be combined with any other feature described herein, whether preferred or not. In the claims that follow, the indefinite article “A”, or “An” refers to a quantity of one or more of the item following the article, except where expressly stated otherwise. The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase “means for.”

GENERATION SUPER SAMPLING

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims