This application is related to an on-device image signal processor (ISP) pipeline using a deep learning model to perform image processing and restoration.
Some camera ISP pipelines integrate deep learning models to address various image processing and restoration tasks, such as image demosaicing (raw-to-RGB processing) and denoising. Using deep learning architectures for high-resolution cameras presents challenges. This is due to high latency and memory consumption of deep learning architectures when processing high resolution images.
In one embodiment, a U-shaped lightweight transformer-based architecture that reduces the inter-block and intra-block system complexity is used to perform image restoration (e.g., demosaicing and denoising) on-device. In some embodiments, transformer blocks of the architecture do not include a self-attention computation.
The proposed model works efficiently with high resolution cameras and minimizes latency and memory usage while maintaining the quality of the output images. The proposed architecture of the model enables it to learn more effectively from image data, resulting in high-quality results with faster and more convenient image processing. Current state-of-the-art image restoration and processing models suffer from high intra-block (within a transformer) and inter-block system (across a U-shaped network) complexity.
A transformer is one building block used in image restoration architectures. Intra-block-complexity refers to the complexity of the transformer block. A conventional transformer block uses self-attention. Self-attention can be regarded as input/token mixer, with quadratic computational complexity with respect to the input size. A self-attention block is often followed or preceded by downsampling or upsampling convolutional layers which require expensive input and output reshaping operations from and to 4D (batch size B, channels C, height H, width W) to and from 3D (B, H*W, C). An optimized on-device implementation for self-attention may not exist while optimized on-device implementations of convolutional layers do exist. Transformer blocks of the present application do not include the self-attention function.
A U-shaped architecture is useful in image processing and restoration models. The U-shaped architecture generally has high inter-block system complexity due to high runtime memory usage partially caused by the concatenation of skip connections between encoder and decoder blocks. Also, the latency increases as the depth (number of encoder/decoder blocks in each stage) increases.
In some embodiments, a smartphone comprises a camera app; a user interface; and custom hardware and/or a CPU. The custom hardware and/or CPU is configured to receive an input image from the camera app, and operate on the input image using a U-shaped hierarchical network to produce an output image. An encoder of the U-shaped hierarchical network contributes to a decoder of the U-shaped hierarchical network using a skip connection based on element-wise addition. Each encoder and decoder block includes a pooling input mixer followed by a first channel scaler, and a multi-layer perceptron followed by a second channel scaler. The custom hardware and/or the CPU is also configured to transmit the output image to the camera app for display on the user interface.
Provided herein is an apparatus for image restoration, the apparatus including: one or more memories storing instructions; and one or more processors configured to execute the instructions to at least: operate on an input image using a U-shaped network to produce an output image, wherein an encoder of the U-shaped network contributes to a decoder of the U-shaped network using a skip connection based on element-wise addition, the encoder is a first instance of a transformer block, the decoder is a second instance of the transformer block, and the transformer block includes: a pooling input mixer followed by a first channel scaler, and use a multi-layer perceptron followed by a second channel scaler.
Also provided herein is a smartphone including: a camera app; a user interface; and an application specific integrated circuit, wherein the application specific integrated circuit is configured to at least: receive an input image from the camera app, operate on the input image using a U-shaped network to produce an output image, wherein an encoder of the U-shaped network contributes to a decoder of the U-shaped network using a skip connection based on element-wise addition, wherein the encoder is a first instance of a transformer, the decoder is a second instance of the transformer and the transformer includes a pooling input mixer followed by a first channel scaler, and a multi-layer perceptron followed by a second channel scaler, and transmit the output image to the camera app for display on the user interface.
Also provided is a method using a U-shaped network, the method including: projecting an input image to obtain a projected input image; processing the projected input image through a plurality of encoders and a plurality of decoders to obtain a last decoder output, wherein the last decoder output is an output of a last decoder; and projecting the last decoder output to obtain an output image, wherein the plurality of encoders includes a first encoder, the plurality of decoders includes the last decoder, the method further includes: processing, by the first encoder, the projected input image in part using a pooling input mixer, and feeding, by the first encoder, an intermediate feature vector to an input of the last decoder using a skip connection, wherein the skip connection is based on element-wise addition using a summation node associated with the skip connection.
Also provided is a second smartphone including: a camera app; a user interface; and the apparatus for image restoration; wherein the instructions are further configured to cause the one or more processors to: receive the input image from the camera app; and transmit the output image to the camera app for display on the user interface.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
Embodiments provide an efficient transformer U-shaped network architecture. The efficiency is achieved at both the transformer block level and at the system level.
The efficient transformer blocks are used as encoders and decoders in the U-shaped network. The architecture is hierarchical. The architecture performs feature fusion using element-wise addition. The use of element-wise addition avoids the dimensionality expansion of concatenation used in other ISP architectures. The feature fusion at skip connections fuses feature maps between the encoders and decoders using the element-wise addition; element-wise addition rather than concatenation reduces runtime memory requirements. The U-shaped network of embodiments tends to increase the width (number of feature channels) and decrease the depth. This reduces latency while maintaining good accuracy.
The transformer block (foundation of the encoder blocks, the decoder blocks and the bottleneck block), uses components in an innovative way to reduce memory, reduce latency and maintain image quality results of the U-shaped network. The components include batch normalization, a pooling input mixer, a channel scaler, and a multi-layer perceptron (MLP).
The batch normalization, during inference, uses moving average statistics calculated during the training process from batches. The batch normalization acts as a linear transformation which can be folded into other preceding linear transformation layers such as convolutional layers. This structural optimization is known as batch normalization folding. The batch normalization is latency-favorable on GPU hardware compared to a layer normalization coupled to a self-attention.
The pooling input mixer is implemented as a pooling layer in which the output shape is equal to the input shape. The pooling input mixer has linear complexity with respect to size. The pooling input mixer does not require input reshaping (compared to a self-attention layer) and this contributes to the latency reduction. The pooling input mixer does not include trainable parameters, and this reduces memory usage.
A channel scaler is included in the transformer block in two places: after the pooling input mixer and after the MLP. The channel scaler is a lightweight layer used to scale each channel with learnable parameters. A learnable parameter may also be referred to as a trainable parameter. Without the channel scaler, the omission of self-attention would cause a performance drop. That is, the channel scaler mitigates this performance drop due to using a simple pooling input mixer instead of self-attention. The channel scaler mitigates the performance drop without a significant increase in latency and without a significant increase in memory.
An MLP is included in the transformer block. The MLP is implemented efficiently with convolutions (with 4D input shape and 4D output shape).
An example use case of the U-shaped network is converting raw images captured by high-resolution phone cameras or other devices to RGB images. This conversion, in some embodiments, includes demosaicing.
In performing a benchmark demosaicing operation with an input size of 1024×1024, embodiments reduce required GMACs (1 GMAC is 1 billion multiply and accumulates) from an order of 200 to an order of 10, reduce a number of trainable parameters from about 30 million to about 1 million, reduce a memory requirement from an order of 800 million bytes to 100 million bytes, reduce a latency from an order of 270 ms (milliseconds) to an order of 30 ms and increase PSNR in a first benchmark from 40 to about 42 dB and increase PSNR in a second benchmark from about 45 dB to about 51 dB.
A second example use of the U-shaped network is denoising of raw or RGB images captured by high-resolution cameras.
In performing a benchmark denoising, embodiments reduce required GMACs from an order of 44 to an order of 15, reduce a number of learnable parameters from an order of 20 to an order of 10, reduce latency on a server GPU from an order of 90 to an order of 11, and increase PSNR from 39.77 to 39.79 (about the same PSNR).
Also provided herein is a non-transitory computer readable medium storing instructions configured to cause one or more processors to: operate on an input image using a U-shaped network to produce an output image, wherein an encoder of the U-shaped network contributes to a decoder of the U-shaped network using a skip connection based on element-wise addition, the encoder is a first instance of a transformer block, the decoder is a second instance of the transformer block, and wherein the transformer block comprises: a pooling input mixer followed by a first channel scaler, and use a multi-layer perceptron followed by a second channel scaler.
Embodiments of the application will now be described with the aid of the drawings.
In some embodiments, image restoration is that of a raw to RGB demosaicing. A demosaicing algorithm is a digital image process used to reconstruct a full color image from the incomplete color samples output from an image sensor. In some embodiments, the image sensor is a Bayer-pixel-layout image sensor. A demosaic error is the difference between a noiseless ground truth RGB image and the output image.
Description of embodiments is provided in
In
As shown in
The U-shaped network 400 is formed, in part, of encoders, decoders and a bottleneck. Each encoder, each decoder and the bottleneck is an instance of the transformer block 30 (see
The input 31 of a transformer block 30 is coupled to a batch normalization 32 (see
The output 41 of the batch normalization 40 is fed to the MLP 42 (see
Input image 1 is processed by input projection 401 to obtain output 402. Output 402 is processed by an instance of transformer block 30, here denoted as efficient transformer encoder 403. The output 404 passes by a skip connection to summation node 419. The output 404 also is downsampled by downsampler 405. Downsampler 405, in an example, is implemented by a two-dimensional convolutional layer with kernel size=4, stride=2, and the number of output channels=2×number of input channels. Signals propagate downward in the downsampling direction to further layers of transformer blocks, skip connections and downsamplers (see items 406 and 407 in
After the lowermost downsampler, data is processed by an instance of transformer block 30 denoted in
As a review of
In some embodiments, the pooling input mixer (item 34) doesn't comprise learnable parameters.
In some embodiments, an input shape (B,C,H,W) of the pooling input mixer (item 34) is equal to an output shape (B,C,H,W) of the pooling input mixer.
In some embodiments, the U-shaped network (item 400) does not apply concatenation to the skip connection (item 404) thereby reducing latency of the apparatus by avoiding a dimensionality expansion.
In some embodiments, the pooling input mixer (item 34) is preceded in the transformer block (item 30) by a batch normalization (item 32), and the batch normalization is configured to be foldable into a preceding linear transformation.
In some embodiments, the transformer block (item 30) comprises a first stage (items 32, 34, 36 and 38) and a second stage (items 40, 42, 44 and 46), the first stage comprises a first batch normalization (item 32), the pooling input mixer (item 34), the first channel scaler (item 36), and a first summation node (item 38), wherein the first summation node operates on an input to the transformer block (30) and an output of the first channel scaler (36), and the second stage comprises a second batch normalization (item 40), the multi-layer perceptron (item 42) followed by the second channel scaler, and a second summation node (item 46), wherein the second summation node operates on an output of the first summation node and an output of the second channel scaler to produce an output of the transformer block (item 30).
Based on the configuration and structure of the U-shape network 400, a latency of the apparatus is reduced by achieving a batch normalization which is latency-favorable on GPU hardware, a memory footprint is reduced by avoiding concatenation at the skip connection, and the avoiding concatenation allows a computation of a layer output to have a reduced latency by avoiding an increase in array dimensions.
In some embodiments, a first decoder (item 420) at a same level of the U-shaped network (400) as a first encoder (item 403) is identical to the first encoder other than learned parameters.
Also, based on
Also discloses is a method (
In some embodiments, the first encoder comprises a transformer, the method further comprising performing, by the first encoder, a first batch normalization (item 32 and
In some embodiments, the method further includes performing, by the first encoder, a pooling input mixer operation (item 34) on the first batch normalization output to obtain a pooling input mixer output, wherein the intermediate feature vector is based in part on the pooling input mixer output.
In some embodiments, the method further includes performing, by the first encoder, a first channel scaling operation (item 36), wherein the first channel scaling operation is on the pooling input mixer output to obtain a first channel scaler output, wherein the intermediate feature vector is based in part on the first channel scaler output.
In some embodiments, the method further includes providing the first channel scaler output to a first summation node to obtain a first summation value; providing the first summation value to the second summation node (see items 39 and 46); performing, by the second summation node, the element-wise addition to obtain a second summation value (item 47); and providing the second summation value as an input to the last decoder (item 420).
In some embodiments, the method further includes performing, by the first encoder, a second batch normalization (item 40) of the first summation value (item 39) to obtain a second batch normalization output (item 41), wherein the intermediate feature vector is based in part on the second batch normalization output.
In some embodiments, the method further includes performing, by the first encoder, a multilayer perceptron (MLP) operation (item 42) on the second batch normalization output (item 41) to obtain an MLP output, wherein the intermediate feature vector is based in part on the MLP output.
In some embodiments, the method further includes performing, by the first encoder, a second channel scaling operation (item 44), wherein the second channel scaling operation is on the MLP output (item 43) to obtain a second channel scaler output, wherein the intermediate feature vector is based in part on the second channel scaler output.
In some embodiments, the method further includes performing the element-wise addition (see item 419) comprises summing, in element-wise fashion, the second channel scaler output (see the intermediate feature vector output from item 403) with an upsampler (item 418) output to provide an input to the last decoder (item 420).
In some embodiments, the input image has a resolution of 1024 by 1024 pixels, the method performing a demosaicing as a restoration of the input image requires approximately 10 billion multiply and accumulates, a latency of the restoration is approximately 30 milliseconds and a signal to noise ratio of the output image is approximately 42 dB.
In some embodiments, a number of trainable parameters in the U-shaped network is about 1 million.
Further details of blocks within transformer block 30 are shown in
The pooling input mixer 34, as an example, may be implemented by a two-dimensional average pooling layer with kernel size=3 and padding=1, except in the first two encoder blocks and the last two decoder blocks where it may be implemented by a two-dimensional maximum pooling layer size=3 and padding=1.
The channel scaler layer, as an example, may be implemented using a vector of trainable parameters of a size equal to the input channels dimension and initialized by a user-defined value before training. The output of the channel scaler layer is the element-wise multiplication of the parameters by the input feature maps over the channel dimension.
The only parameterization which is needed in
The input projection 401 is a convolutional layer. The input projection projects the input features into feature maps with D (a hyper-parameter set by the user) channels. In some embodiments, the input projection layer is implemented by a two-dimensional convolutional layer.
Output projection 422 is realized by a convolutional layer. The output projection projects the feature maps output from the last decoder into the output features, with number of channels equal to the number of input image channels.
The trainable parameters of the U-shaped network 400 are trained offline before deployment in the device, such as a smartphone.
Training is performed, for example, using supervised learning using backpropagation given batches of raw data images and corresponding ground truth restored images.
Trainable parameters exist in the batch normalization instances, the channel scaler instances, the MLP instances, the input projection 401, the output projection 422, the downsampler layers and the upsampler layers.
Training is done by initially estimating the number of layers to use in the U-shaped network 400 and creating the artificial intelligence machine 400 by training over data sets. The performance of the resulting U-shaped network 400 is then compared with goals and constraints. If the goals are met and the constraints are not violated, the design is complete. Otherwise, adjustments are made to the number of batches B, the number of channels C, the number of layers in
Hardware for performing embodiments provided herein is now described with respect to
Embodiments may be deployed on various computers, servers or workstations.
Apparatus 119 also may include a user interface 115 (for example a display screen and/or keyboard and/or pointing device such as a mouse) and in image sensor 111. Apparatus 119 may include one or more volatile memories 112 and one or more non-volatile memories 113. The one or more non-volatile memories 113 may include a non-transitory computer readable medium storing instructions for execution by the one or more hardware processors 118 to cause apparatus 119 to perform any of the methods of embodiments disclosed herein.
In some embodiments, a smartphone including one or more processors includes a camera app; a user interface and the apparatus of
This application claims benefit of priority to U.S. Provisional Application No. 63/471,727 filed in the USPTO on Jun. 7, 2023. The content of the above application is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63471727 | Jun 2023 | US |