The disclosure herein relates to the field of deep learning networks for resources allocation in video processing.
Machine learning systems provide critical tools to advance new technologies including image processing and computer vision, automatic speech recognition and autonomous vehicles. Video consists of 70% of Internet traffic in the U.S. New video encoding standards have been developed to reduce the bandwidth requirement. The new encoding standards such as H265 (or High Efficiency Video Coding) encoding or Alliance of Open Media's AV1 encoding reduces the bandwidth requirement over the previous generations.
However, the increase of the resolution and refresh rate of video increases the bandwidth requirement to transfer the encoded video streams over the Internet. In addition, the last mile of the Internet can be a mobile network, and even in 5G network the bandwidth can be fluctuating and cannot be guaranteed. Furthermore, there currently exists no guaranteed Quality of Service (QoS) in the Internet, which adds to the video delivery problem.
Among other technical advantages and benefits, solutions herein provide for allocating of video scaling computational processing resources in an artificial intelligence deep learning model in a manner that optimizes generation, transmission, and rendering of the video content amongst a video generating server device and one or more video rendering devices within a communication network system. In particular, the disclosure herein introduces a new method of device artificial intelligence (AI)-resource aware jointly optimized networks for both deep-learning-based downscaling at the video generating server in conjunction with upscaling at the video rendering device side. AI and deep learning are used interchangeably as referred to herein.
In accordance with a first example embodiment, a method of allocating video scaling resources across devices of a communication network is provided. The method includes estimating a first set of layers of the deep learning model for downscaling of a video content stream generated at a video server of the communication network; estimating a second set of layers of the deep learning model for upscaling of the video content stream at a rendering device of the communication network; and allocating resources amongst the video server and the rendering device at least partly in accordance with the first set and the second set, wherein the allocating minimizes resources allocated for upscaling at the rendering device and optimizes the video quality of the video displayed at the rendering device.
In accordance with a second example embodiment, a non-transient memory including instructions executable in one or more processors is provided. The instructions are executable to estimate a first set of layers of the deep learning model for downscaling of a video content stream generated at a video server of the communication network; estimate a second set of layers of the deep learning model for upscaling of the video content stream at a rendering device of the communication network; and allocate resources amongst the video server and the rendering device at least partly in accordance with the first set and the second set, wherein the allocating minimizes resources allocated for upscaling at the rendering device and optimizes the video quality of the video displayed at the rendering device.
One or more embodiments described herein provide that methods, techniques, and actions performed by a computing device are performed programmatically, or as a computer-implemented method. Programmatically, as used herein, means through the use of code or computer-executable instructions. These instructions can be stored in one or more memory resources of the computing device.
Furthermore, one or more embodiments described herein may be implemented through the use of logic instructions that are executable by one or more processors. These instructions may be carried on a computer-readable medium. In particular, machines shown with embodiments herein include processor(s), various forms of memory for storing data and instructions, including interface and associated circuitry. Examples of computer-readable mediums and computer storage mediums include flash memory and portable memory storage units. A processor device as described herein utilizes memory, and logic instructions stored on computer-readable medium. Embodiments described herein may be implemented in the form of computer processor-executable logic instructions in conjunction with programs stored on computer memory mediums, and in varying combinations of hardware in conjunction with the processor-executable instructions or code.
Video resolutions are typically specified in 8K, 4K, 1080p, 720p, 540p, 360p, etc. Higher resolution requires more bandwidth, given the same refresh rate and same encoding standard. For example, 4K@30 fps (3840×2160) means 3840×2160 pixels/frame at 30 frames per second and needs up to 4 times more bitrate than 1080p@30 fps (1920×1080 pixels/frame at 30 frames per second). An example is 1080p@30 fps needs 4 Mbps while 4K30 fps needs up to 16 Mbps.
When streaming the video of a particular resolution, the streaming service also needs to prepare lower resolutions in case of fluctuating Internet bandwidth. For example, while streaming 4K@30 fps resolution, the video streaming services also needs to prepare to stream 1080p@30 fps, or 720p@30 fps or even 360p@30 fps. Downscaling is required to downscale the original 4K30 fps video or image to smaller resolution video. In case of Internet bandwidth issues, a downscaled and encoded video is sent over the Internet instead of the encoded version of the original video resolution. The smaller the Internet available bandwidth, the smaller the resolution video is needed to accommodate the smaller available bandwidth.
While bandwidth is reduced during the downscaling and encoding process, picture or video quality is also compromised as any downscaling typically introduces information loss to some extent.
A good quality video or image upscaler may be used at the device side to recover some of the video loss due to the downscaling process at the video server side. A deep learning-based upscaling solution may be used to improve the upscaling video quality. Deep learning-based upscaling is shown in
While deep learning-based upscaling offers very good video quality, it is often very expensive to implement the deep learning based upscaler in the hardware (i.e. CPU, GPU or hardware accelerators). Furthermore, when upscaling from lower resolution to very high resolution, for example 1080p to 8K or 720p or lower resolution to 4K, even deep learning-based upscaling can be challenging to improve video quality.
While deep learning-based scaling offers very good video quality, it is often very expensive in implementing the deep learning in the hardware (i.e. CPU, GPU or hardware accelerators). In particular, the device side is limited in supporting complex deep learning networks due to power and cost constraints.
The disclosure herein introduces a new method of device AI-resource aware jointly optimized networks for both deep-learning-based downscaling and upscaling. AI and deep learning are used interchangeably as referred to herein.
Device AI-resource (or deep learning resources) can be specified in MACs/s (Multiplier and Accumulations/second). An example on how MACs/s is calculated for one 3×3 convolution layer of a deep learning network: 480×270 pixels with 3×3 convolution kernels, input channel is 128 and output feature is 128. The total MACs required per frame is=480×270×3×3×128×128=19,110,297,600 MACs per frame or 573,308,928,000 MACs/s, or ˜0.573 Tera MACs/s. If a device only has 2 Tera MACs/s of AI hardware resources, then it can only run 3 layers of similar computational complexity. Deep learning network typically has many layers. AI-resources are typically very expensive and power hungry in devices, and hence devices cannot run many deep learning network layers.
In this novel proposed technique, a minimum AI resource (i.e. GMACs/s) is assumed on the device side, and example can be 1 or 2 Tera MACs/s. On the video server side typically more AI resources can be used and it also has more power budget.
This new technique of deep learning networks is developed with the following constraints and goals: 1) device AI-resource aware with the goal of minimizing the device side AI resources; and 2) jointly optimized for both deep learning-based downscaling on the video server side and deep learning-based upscaling on the device side, with the goal of achieving video quality as close to original video.
Subjective video quality matrix and objective quality matrix (such as PSNR or SSIM) are typically used to evaluate the end-to-end video quality.
In this case, about 30% of the overall AI computational resources are on the AI-upscaling side in the devices and 70% of the overall AI computational resources are on the AI-downscaling side in the video server. The video encoder and video decoder are omitted in FIGS. 5 and 6 for simplicity. In real practice video encoder and decoder are typically required as shown in
Based on the foregoing examples of
In alternative implementations, at least some hard-wired circuitry may be used in place of, or in combination with, the software logic instructions to implement examples described herein. Thus, the examples described herein are not limited to any particular combination of hardware circuitry and software instructions. Additionally, it is also contemplated that in alternative embodiments, the techniques herein, or portions thereof, may be distributed between several processors working in conjunction.
In
At step 710, estimating a first set of layers of the deep learning model for downscaling of a video content stream generated at a video server of the communication network.
In one aspect, the deep learning model may be a trained deep learning model, and in a further variation, a convolution deep learning model.
The video content, in one embodiment, may be generated at the server device, in accordance with the downscaling and the allocating for transmission to the rendering device.
At step 720, estimating a second set of layers of the deep learning model for upscaling of the video content stream at a rendering device of the communication network.
At step 730, allocating resources amongst the video server and the rendering device at least partly in accordance with the first set and the second set.
In some variations, one layer or more layers of the first set of layers for downscaling the video content at the video server, and also of the second set of layers for upscaling the video content at the rendering device may respectively be associated with a number of input channels and a number of output channels of the deep learning model.
In a further aspect, the allocating of computational processing resources for the downscaling and the upscaling may be based on the AI resources required for all the layers in an AI network. The AI resource of each layer depends on the convolution kernel size, the resolution of the image of the layer, number of input channels and the number of output channels of the layer
In response to the allocating, the method may further comprise upscaling the video content at the rendering device based on the allocating for display thereon. In embodiments, the rendering device may comprise any one or more of a television display device, a laptop computer, and a mobile phone, or similar video or image rendering devices.
In this manner, the allocating of video scaling and processing resources minimizes resources allocated for upscaling at the rendering device and optimizes the video quality of the video displayed at the rendering device.
In some embodiments, it is contemplated that the resource allocation techniques disclosed herein may be implemented in one or more of a field-programmable gate array (FPGA) device, a graphics processing unit (GPU) device, a central processing unit (CPU) device, and an application-specific integrated circuit (ASIC).
It is contemplated that embodiments described herein be extended and applicable to individual elements and concepts described herein, independently of other concepts, ideas or system, as well as for embodiments to include combinations of elements in conjunction with combinations of steps recited anywhere in this application. Although embodiments are described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments. As such, many modifications and variations will be apparent to practitioners skilled in this art. Accordingly, it is intended that the scope of the invention be defined by the following claims and their equivalents. Furthermore, it is contemplated that a particular feature described either individually or as part of an embodiment can be combined with other individually described features, or parts of other embodiments, even if the other features and embodiments make no mention of the particular feature. Thus, any absence of describing combinations does not preclude the inventors from claiming rights to such combinations.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2020/050791 | 6/10/2020 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62859987 | Jun 2019 | US |