The present disclosure relates generally to images. More particularly, an embodiment of the present invention relates to workload allocation and processing in cloud-based coding of high dynamic range (HDR) video.
As used herein, the term ‘dynamic range’ (DR) may relate to a capability of the human visual system (HVS) to perceive a range of intensity (e.g., luminance, luma) in an image, e.g., from darkest grays (blacks) to brightest whites (highlights). In this sense, DR relates to a ‘scene-referred’ intensity. DR may also relate to the ability of a display device to adequately or approximately render an intensity range of a particular breadth. In this sense, DR relates to a ‘display-referred’ intensity. Unless a particular sense is explicitly specified to have particular significance at any point in the description herein, it should be inferred that the term may be used in either sense, e.g. interchangeably.
As used herein, the term high dynamic range (HDR) relates to a DR breadth that spans the 14-15 orders of magnitude of the human visual system (HVS). In practice, the DR over which a human may simultaneously perceive an extensive breadth in intensity range may be somewhat truncated, in relation to HDR. As used herein, the terms visual dynamic range (VDR) or enhanced dynamic range (EDR) may individually or interchangeably relate to the DR that is perceivable within a scene or image by a human visual system (HVS) that includes eye movements, allowing for some light adaptation changes across the scene or image. As used herein, VDR may relate to a DR that spans 5 to 6 orders of magnitude. Thus, while perhaps somewhat narrower in relation to true scene referred HDR, VDR or EDR nonetheless represents a wide DR breadth and may also be referred to as HDR.
In practice, images comprise one or more color components (e.g., luma Y and chroma Cb and Cr) wherein each color component is represented by a precision of n-bits per pixel (e.g., n=8). For example, using gamma luminance coding, images where n≤8 (e.g., color 24-bit JPEG images) are considered images of standard dynamic range, while images where n≥10 may be considered images of enhanced dynamic range. HDR images may also be stored and distributed using high-precision (e.g., 16-bit) floating-point formats, such as the OpenEXR file format developed by Industrial Light and Magic.
Most consumer desktop displays currently support luminance of 200 to 300 cd/m2 or nits. Most consumer HDTVs range from 300 to 500 nits with new models reaching 1,000 nits (cd/m2). Such conventional displays thus typify a lower dynamic range (LDR), also referred to as a standard dynamic range (SDR), in relation to HDR. As the availability of HDR content grows due to advances in both capture equipment (e.g., cameras) and HDR displays (e.g., the PRM-4200 professional reference monitor from Dolby Laboratories), HDR content may be color graded and displayed on HDR displays that support higher dynamic ranges (e.g., from 1,000 nits to 5,000 nits or more).
As used herein, the term “forward reshaping” denotes a process of sample-to-sample or codeword-to-codeword mapping of a digital image from its original bit depth and original codewords distribution or representation (e.g., gamma, PQ, HLG, and the like) to an image of the same or different bit depth and a different codewords distribution or representation. Reshaping allows for improved compressibility or improved image quality at a fixed bit rate. For example, without limitation, reshaping may be applied to 10-bit or 12-bit PQ-coded HDR video to improve coding efficiency in a 10-bit video coding architecture. In a receiver, after decompressing the received signal (which may or may not be reshaped), the receiver may apply an “inverse (or backward) reshaping function” to restore the signal to its original codeword distribution and/or to achieve a higher dynamic range.
In many video-distribution scenarios, HDR video may be coded in a multi-processor environment, typically referred to as a “cloud computing server.” In such an environment, trade-offs among ease of computing, workload balance among the computing nodes, and video quality, may force reshaping-related metadata to be updated on a frame-by-frame basis, which may result in unacceptable overhead, especially when transmitting video at low bit rates. As appreciated by the inventors here, improved techniques for workload allocation and node-based processing to improve the quality of coded video in a cloud-based environment while minimizing the overhead of reshaping-related metadata are desired.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.
An embodiment of the present invention is illustrated by way of example, and not in way by limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
Methods for workload allocation and node-based processing in cloud-based video coding of HDR video are described herein. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are not described in exhaustive detail, in order to avoid unnecessarily occluding, obscuring, or obfuscating the present invention.
Example embodiments described herein relate to cloud-based reshaping and coding for HDR images. In an embodiment, in a cloud-based system for encoding HDR video, a node is assigned to be a dispatcher node segmenting the input video into scenes and generating a scene to segment allocation to be used by other computing nodes. A processor in the dispatcher node receives a sequence of scenes, wherein each scene comprises one or more video frames, then the processor:
receives a sequence of scenes, wherein each scene comprises one or more video frames; and
performs one or more assignment iterations to generate a best output assignment, wherein performing the one or more assignment iterations comprises:
In another embodiment, for a node among the M computing nodes, a processor in the node accesses according to a scene to segments assignment a scene assigned to the node, the scene comprising a sequence of high-dynamic range (HDR) frames and a sequence of corresponding standard dynamic range frames (SDR), and generates an output bitstream and corresponding reshaping metadata using a scene-based forward reshaping function and a scene-based backward reshaping function.
Example HDR Coding System
Under this framework, given reference HDR content (120) and corresponding reference SDR content (125) (that is, content that represents the same images as the HDR content, but color-graded and represented in standard dynamic range), reshaped HDR content (134) is encoded and transmitted as SDR content in a single layer of a coded video signal (144) by an upstream encoding device that implements the encoder architecture. The received SDR content is received and decoded, in the single layer of the video signal, by a downstream decoding device that implements the decoder architecture. Backward-reshaping metadata (152) is also encoded and transmitted in the video signal with the reshaped content so that HDR display devices can reconstruct HDR content based on the (reshaped) SDR content and the backward reshaping metadata. Without loss of generality, in some embodiments, as in non-backward-compatible systems, reshaped SDR content may not be watchable on its own, but must be watched in combination with the backward reshaping function, which will generate watchable SDR or HDR content. In other embodiments which support backward compatibility, legacy SDR decoders can still playback the received SDR content without employing the backward reshaping function.
As illustrated in
Examples of backward reshaping metadata representing/specifying the optimal backward reshaping functions may include, but are not necessarily limited to only, any of: an inverse tone mapping function, inverse luma mapping functions, inverse chroma mapping functions, lookup tables (LUTs), polynomials, inverse display management coefficients/parameters, etc. In various embodiments, luma backward reshaping functions and chroma backward reshaping functions may be derived/optimized jointly or separately, may be derived using a variety of techniques, for example, and without limitation, as described later in this disclosure.
The backward reshaping metadata (152), as generated by the backward reshaping function generator (150) based on the reshaped SDR images (134) and the target HDR images (120), may be multiplexed as part of the video signal 144, for example, as supplemental enhancement information (SEI) messaging.
In some embodiments, backward reshaping metadata (152) is carried in the video signal as a part of overall image metadata, which is separately carried in the video signal from the single layer in which the SDR images are encoded in the video signal. For example, the backward reshaping metadata (152) may be encoded in a component stream in the coded bitstream, which component stream may or may not be separate from the single layer (of the coded bitstream) in which the SDR images (134) are encoded.
Thus, the backward reshaping metadata (152) can be generated or pre-generated on the encoder side to take advantage of powerful computing resources and offline encoding flows (including but not limited to content adaptive multiple passes, look ahead operations, inverse luma mapping, inverse chroma mapping, CDF-based histogram approximation and/or transfer, etc.) available on the encoder side.
The encoder architecture of
In some embodiments, as illustrated in
Optionally, alternatively, or in addition, in the same or another embodiment, a backward reshaping block 158 extracts the backward (or forward) reshaping metadata (152) from the input video signal, constructs the backward reshaping functions based on the reshaping metadata (152), and performs backward reshaping operations on the decoded SDR images (156) based on the optimal backward reshaping functions to generate the backward reshaped images (160) (or reconstructed HDR images). In some embodiments, the backward reshaped images represent production-quality or near-production-quality HDR images that are identical to or closely/optimally approximating the reference HDR images (120). The backward reshaped images (160) may be outputted in an output HDR video signal (e.g., over an HDMI interface, over a video link, etc.) to be rendered on an HDR display device.
In some embodiments, display management operations specific to the HDR display device may be performed on the backward reshaped images (160) as a part of HDR image rendering operations that render the backward reshaped images (160) on the HDR display device.
Existing reshaping techniques may be frame-based, that is, new reshaping metadata is transmitted with each new frame, or scene-based, that is, new reshaping metadata is transmitted with each new scene. As used herein, the term “scene” for a video sequence (a sequence of frames/images) may relate to a series of consecutive frames in the video sequence sharing similar luminance, color and dynamic range characteristics. Scene-based methods work well in video-workflow pipelines which have access to the full scene; however, it is not unusual for content providers to use cloud-based multiprocessing, where, after dividing a video stream into segments, each segment is processed independently by a single computing node in the cloud. As used herein, the term “segment” denotes a series of consecutive frames in a video sequence. A segment may be part of a scene or it may include one or more scenes. Thus, processing of a scene may be split across multiple processors.
As discussed in Ref. [1], in certain cloud-based applications, under certain quality constraints, segment-based processing may necessitate generating reshaping metadata on a frame-by-frame basis, resulting in undesirable overhead. This may be an issue in very low bit-rate applications (e.g., lower than 1 Mbit/s).
As depicted in
Given a video source (202) for content distribution, typically referred to as a mezzanine file, the first stage node fetches video metadata (e.g., from XML file) and: (a) in step 215, it determines the scene boundaries and (b) in step 220, it decides the scene-to-segment assignment list for each worker node. The main goal for the scene boundary determination is to make sure there is no significant luminance or color change during normal playback within one scene, including fade-ins, fade-outs, and dissolves. (A dissolve in video editing refers to a smooth transition from one image to another. Dissolves between a blank (or black) image to another image are also referred to as a fade-in or a fade-out.) A goal of the scene-to-segment assignment unit (220) is to ensure one scene won't be partitioned and encoded in two different computing nodes, which may cause sudden changes near segment boundaries. In addition, the assignment task should strive for uniform workload across all computing nodes (210).
At the second stage (210), each computing node receives its own scene-to-segment list (S2S list) (230) from stage one, and its own partial mezzanine (225) for the corresponding segment from the input video (202). Each node, in parallel, encodes its assigned segments and outputs a coded bitstream. Details for each processing task are discussed next.
Depending on the requirements of workload distribution in each node, there are two main scenarios of interest. In one embodiment, segments may have non-uniform length, which allows for non-uniform workload in different nodes. This is tailored for a scene-based solution where a scene cannot be partitioned to be encoded in more than one node. In another embodiment, segments have a fixed length, thus enforcing uniform workload across all nodes. Under the proposed dispatcher and worker node model, the proposed architecture can address both scenarios.
In the non-uniform segment-length scenario, each worker node may receive a different workload for processing. To enable scene-based encoding, the dispatcher node reads in the XML file (extracted from mezzanine) and determines the scene boundaries, especially how to partition or merge frames within the fade-in, /fade-out, and dissolving scenes into new scene cut boundaries. Among those new defined scenes, the dispatcher determines which scene should be encoded by which node. The output of this process will be a scene-to-segment (S2S) list (230). The main goal is to distribute the number of frames in each node as uniformly as possible. In an embodiment, without limitation, a metric to measure the uniformity in this stage is the standard deviation of the number of frames allocated in each node. A lower standard deviation implies more uniform workload in each node. In an embodiment, the S2S list may be derived as the output of an optimization problem for best uniform load across all nodes under uninterrupted scene-processing constraints.
Scene cuts may be defined in the XML file of the video source (202), but typically such metadata define color-grading boundaries. For example, a scene-cut flag can be inserted during a dissolving scene for the convenience of color grading so that during playback the display management process will not distort the colors. However, these XML data does not take into consideration that the baseline data is reshaped and that reshaping may affect the final look within a dissolve. In an embodiment, to avoid such issues, a dissolve may be partitioned into multiple single frames per scene to allow for a slow transition along the time domain. Note that this method will increase the bit rate of reshaping-related metadata during those special transition effects. The same techniques can be applied to fade-in and fade-out transitions.
When an XML file is not available, the dispatcher will need to identify scene cuts on its own, using any of the known scene-cut detection techniques known in the art. For example, in an embodiment, one can measure the luminance change along the time domain and see whether the change is in constant rate. Once a scene cut is detected, one can partition the entire scene cut into single frames, each frame representing a separate “scene.”
In addition to the above methods, to avoid false scene-cut boundaries, one can also consider soft transitions near the scene-cut boundaries. For example, for a detected scene cut, one can add a small number of single-frame “scenes” before the scene cut and a small number of single-frame “scenes” after the scene cut. Such a method will increase the bit rate of scene-based metadata.
Given the scene-boundary decisions (215), the scene-to-segment unit (220) decides which scene should be included in which segment. This kind of assignment will yield a scene-to-segment assignment list (S2S list) (230). The dispatcher node will output one S2S list for each worker node.
Consider a video sequence with J total frames, grouped into K scenes. Denote the corresponding starting frame index for the k-th scene as Sk and denote the number of frames for the k-th scene as Dk, where k=0, 1, . . . , K−1. Thus:
D
k
=S
k+1
−S
k, (1)
J=Σ
k=0
K-1
D
k, (2)
Denote the number of worker nodes as M. To distribute the scenes to each node, in an embodiment, the following rules may be imposed:
Denote the collection of scenes assigned to node m as Φm, where m=0, 1, . . . , M−1. Following the aforementioned rules, one can define the first scene index inside Φm as ϕm, where ϕm has value range between 0 and K−1. In an embodiment, to simplify implementation, a monotonically incremental rule may be enforced that is:
ϕm<ϕn when m<n
In an embodiment, ϕ0=0, thus, the first scene is always assigned to the first segment. When the number of scenes K is larger than the number of nodes M, ϕm must be unique, i.e., its value cannot be the same in any of the other nodes. This is to ensure no node has zero workload. When K<M, the simple solution is to assign each node one scene; and leave the rest of the nodes with zero scenes.
At the beginning of the operation, i.e. t=0, Ω(0) includes all scenes except the first scene (i.e., Ω(0)={ϕm|m=1, . . . , K−1}) and Ψ(0) contains only the first scene ϕ0. In the t-th iteration (t>0), one randomly selects one element from Ω(t-1), removes this element from Ω(t-1), and puts this chosen element to set Ψ(t). This process is repeated M−1 times until Ψ(t) contains M elements which are sorted in ascending order. The sorted Ψ(t) will be the output from this stage. Table 1 expresses this process in pseudocode.
As an example, consider a list of 10 scenes to be allocated in 3 nodes, with each scene having a variable number of frames as depicted below
Let the output of step 305 be Ψ(2)={0, 3, 8}, then, after this step, scenes are assigned to
nodes (or segments) as follows:
Node 0: scenes 0-2
Node 1: scenes 3 to 7
None 2: scenes 8-9
In step 310, this initial, random assignment (Ψ(M-1)), is further refined iteratively as shown in
In another embodiment, instead of starting the node iteration (e.g., steps 350, 355, and 360) at node 0 and moving forward, one may also begin the node iteration at node M−1 and move backwards. Alternatively, one may also try iterating among all nodes both ways and select the workload with the minimal cost between the two.
After stage 310, given the refined scene to segment allocation, in step 315, a new best overall assignment cost (and associated S2S assignment) may be computed. In an embodiment, to avoid a bad random initialization step 305, which may lead to sub-optimal allocation, steps 305-315 are repeated L times for L different random initialization steps 305 (e.g., by using a different random seed generator), each one resulting in an overall assignment cost(l), l=1, 2, . . . L (e.g., σf,lopt). Then, in step 315, one selects the assignment with the best overall cost (e.g., the smallest standard deviation σf,lopt). Experimental results showed that L=100 combined with the refined assignment step (310) yields satisfactory results and that larger values of L fail to significantly improve the overall S2S allocation strategy.
Thus, at l=1, σf,1opt simply represents the first refined assignment cost (that is, σf*=σf,1opt=σfopt), where σf* denotes the best overall assignment cost. At subsequent iterations, if σf*<σf,lopt, then this iteration is ignored, otherwise, the best assignment cost is updated, (e.g., σf*=σf,lopt) and the corresponding workload for this iteration is considered the best scene to segment assignment.
Step 320 checks if all L iterations are done and, if yes, then in step 325, the best scene to segment allocation, that is the one with the best cost among all L iterations, is outputted, otherwise, the process repeats with another initial random assignment (305).
To facilitate the discussion, one more variable ϕM(t)=K is added to indicate the end of the video sequence. Given a set of {ϕm(t)}, m=0, 1, . . . , M, one can compute the number of frames in each node at the t-th iteration as
In an embodiment, the uniformity of workload, or assignment cost, can be defined as the standard deviation of {fm(t)}, where
The lower value of σf(t) is, the more uniform workload is distributed to each node. Example pseudo code for this refine-assignment stage is listed in Table 2.
While in an embodiment, and without limitation, the standard deviation of frames being used provides a good cost metric for the refine-assignment stage, alternative cost metrics may also be applied, such as:
Returning to our example, Tables 3 and 4 depict the scene to segment allocation and corresponding S2S parameters after the random initialization stage. The overall cost, measured by the standard deviation among the values in {fm(0)}, can be computed as σf(0)=6.81.
Consider now an example of refined assignment (310). At the first iteration of this stage, where t=0, for the first node, m=0, three different strategies are tried, and the standard deviation, σf(0), for each case is measured. The results are depicted in Tables 5 and 6. As depicted in Table 5, under option A, node 0 is assigned only scenes 0 and 1, at a cost of 10.12, under option B, node 0 is assigned scenes 0-3, at a cost of 5.04, and under option C (no change from before) the cost remains the same (6.81). Thus, option B is selected as the best strategy to continue to refine the assignment of scenes at subsequent nodes, where the same process will be repeated.
At the end of t=0, in this example, the best S2S remained the one depicted in Table 6 with cost σfopt 5.04; that is:
Node 0: scenes 0-3
Node 1: scenes 4-7
Node 2: scenes 8-9
Next, at t=1, steps 350, 355, and 360 are repeated. In this example, at t=1 there is no improvement in the overall cost, thus the process will terminate.
In some embodiments, it may be preferred that all segments have the same number of frames. For such a scenario, one can assign the number of frames for the first M−1 nodes as
The remaining frames will be assigned to the last node (node M−1) as
Given a scene-to-segment allocation (230),
From
Scene-based generation of a forward reshaping function (405) consists of two levels of operation. First, statistics are collected for each frame. For example, for luma, one computes the histograms for both SDR (hjs (b)) and HDR (hjv (b)) frames and stores them in the frame buffer for the j-th frame, where b is the bin index. After generating the 3DMT representation for each frame, one generates an “a/B” matrix representation denoted as:
B
j
F=(SjF)TSjF,
a
j
F,ch=(SjF)TvjF,ch, (7)
where ch refers to a luma or chroma channel (e.g., Y, Cb, or Cr), (SjF)T denotes a transpose matrix based on the reference HDR scene data and a parametric model of the forward reshaping function, and vjF,ch denotes a vector based on the SDR scene data and the parametric model of the forward reshaping function.
Given the statistics of each frame within the current scene, one can apply a scene-level algorithm to compute the optimal forward reshaping coefficients. For example, for luma, one can generate scene-based histograms for SDR (hs(b)) and HDR data (hv (b)) by summing or averaging the frame-based histograms. For example, in an embodiment,
Having both scene-level histograms, one can apply cumulative density function (CDF) matching (Ref. [4-5]) to generate the forward mapping function (FLUT) from HDR to SDR, e.g.,
{tilde over (T)}
F
=CDF_MATCHING(hv(b),hs(b)). (9)
For chroma (e.g., ch=Cb or ch=Cr), one may again average over the a/B frame-based representations in equations (7) to generate a scene-based a/B matrix representation given by
and generate parameters for a multiple-color, multiple-regression (MMR) model of a reshaping function as (Ref. [2-3])
m
F,ch=(BF)−1aF,ch. (11)
Then, the reshaped SDR signal (407) can be generated as:
{circumflex over (v)}
j
F,ch
=B
F
m
F,ch. (12)
Generating the scene-based backward reshaping function (410) includes also both frame-level and scene-level operations. Since the luma mapping function is a single-channel predictor, one can simply revert the forward reshaping function to obtain the backward reshaping function. For chroma, one forms a 3DMT representation using the reshaped SDR data (407) and the original HDR data (404) and computes a new frame-based a/B representation as:
B
j
B=(SjB)TSjB,
a
j
B,ch=(SjB)TvjB,ch. (13)
At the scene-level, for luma, one may apply the histogram-weighted BLUT construction in Ref. [3] to generate the backward luma reshaping function. For chroma, one can again average the frame-based a/B representation to compute a scene-based a/B representation
with an MMR model solution for the backward reshaping mapping function given by
m
B,ch=(BB)−1aB,ch. (15)
Then, the reconstructed HDR signal (160) can be generated as:
{circumflex over (v)}
j
B,ch
=B
B
m
B,ch. (16)
Each of these references is incorporated by reference in its entirety.
Example Computer System Implementation
Embodiments of the present invention may be implemented with a computer system, systems configured in electronic circuitry and components, an integrated circuit (IC) device such as a microcontroller, a field programmable gate array (FPGA), or another configurable or programmable logic device (PLD), a discrete time or digital signal processor (DSP), an application specific IC (ASIC), and/or apparatus that includes one or more of such systems, devices or components. The computer and/or IC may perform, control or execute instructions relating to workload allocation and node-based processing in cloud-based video coding of HDR video, such as those described herein. The computer and/or IC may compute, any of a variety of parameters or values that relate to workload allocation and node-based processing in cloud-based video coding of HDR video as described herein. The image and video dynamic range extension embodiments may be implemented in hardware, software, firmware and various combinations thereof.
Certain implementations of the invention comprise computer processors which execute software instructions which cause the processors to perform a method of the invention. For example, one or more processors in a display, an encoder, a set top box, a transcoder or the like may implement methods for workload allocation and node-based processing in cloud-based video coding of HDR video as described above by executing software instructions in a program memory accessible to the processors. The invention may also be provided in the form of a program product. The program product may comprise any non-transitory and tangible medium which carries a set of computer-readable signals comprising instructions which, when executed by a data processor, cause the data processor to execute a method of the invention. Program products according to the invention may be in any of a wide variety of non-transitory and tangible forms. The program product may comprise, for example, physical media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, or the like. he computer-readable signals on the program product may optionally be compressed or encrypted.
Where a component (e.g. a software module, processor, assembly, device, circuit, etc.) is referred to above, unless otherwise indicated, reference to that component (including a reference to a “means”) should be interpreted as including as equivalents of that component any component which performs the function of the described component (e.g., that is functionally equivalent), including components which are not structurally equivalent to the disclosed structure which performs the function in the illustrated example embodiments of the invention.
Enumerated example embodiments (EEE) of the present invention are defined, without limitation, as follows:
Example embodiments that relate to workload allocation and node-based processing in cloud-based video coding of HDR video are thus described. In the foregoing specification, embodiments of the present invention have been described with reference to numerous specific details that may vary from implementation to implementation. Thus, the sole and exclusive indicator of what is the invention, and what is intended by the applicants to be the invention, is the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims. Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Number | Date | Country | Kind |
---|---|---|---|
20184883.5 | Jul 2020 | EP | regional |
This application claims the benefit of priority from U.S. patent application No. 63/049,673 and European patent application EP20184883.5, both filed on 9 Jul. 2020, each of which is incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/040967 | 7/8/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63049673 | Jul 2020 | US |