Video analysis is used in many different types of applications. Some types of video analysis, such video quality assessment (VQA), may quantify the perceptual quality of videos (as opposed to still images). Other types of video analysis may be used to classify aspects of a video, such as objects appearing therein. Such analysis can include analyzing imagery, including video imagery, using convolutional neural networks (CNNs). However, issues such as unstable or shaky cameras, camera lens flaws, varying resolutions and frame rates, different algorithms and parameters for processing and compression, may all adversely impact the assessments for video.
Various VQA methods focus on full reference (FR) scenarios, in which distorted videos are compared against their corresponding pristine reference. In recent years, there is an explosion of user generated content (UGC) videos on social media platforms. For many UGC videos, the high-quality pristine reference can be inaccessible. Thus, no-reference (NR) VQA models can be used for ranking, recommending and optimizing UGC videos. Certain NR-VQA models leverage the power of machine learning to enhance results. Many known deep-learning approaches use CNNs to extract frozen frame-level features and then aggregate them in the temporal domain to predict the video quality. Since frozen frame-level features are not optimized for capturing temporal-spatial distortions, this could be insufficient to catch diverse spatial or temporal impairments in UGC videos. For example, transient glitches or frame drops may not have big frame pixel difference, but it can greatly impact user-perceived video quality. Moreover, predicting UGC video quality often involves long-range spatial-temporal dependencies, such as fast-moving objects or rapid zoom-in views. Since convolutional kernels in CNNs are specifically designed for capturing short-range spatial-temporal information, they cannot capture dependencies that extend beyond the receptive field. This limits CNN models' ability to model complex spatial-temporal dependencies in UGC VQA tasks, and therefore such an approach may not effectively aggregate complex quality information in diverse UGC videos.
Aspects of the technology employ a no-reference VQA framework based on the Transformer architecture, which employs a multi-resolution input representation and a patch sampling mechanism to effectively aggregate information across different granularities in spatial and temporal dimensions. Unlike CNN models that are constrained by limited receptive fields, Transformers utilize the multi-head self-attention operation that allows the model to attend over all elements in the input sequence. As a result, Transformers can capture both local and global long-range dependencies by directly comparing video quality features at all spacetime locations. The approach discussed herein can achieve state-of-the-art performance on multiple UGC VQA datasets including LSVQ, LSVQ-1080p, KoNVID-1k and LIVE-VQC.
According to one aspect, a method for processing videos comprises grouping and rescaling, by one or more processors, neighboring input video frames of a single video into a pyramid of multi-resolution frames including both lower resolution frames and higher resolution frames; sampling, by the one or more processors, the pyramid of multi-resolution frames to obtain a set of patches; encoding, by the one or more processors, the set of patches as a set of multi-resolution input tokens; generating, by a spatial transformer encoder implemented by the one or more processors, a representation per frame group for a plurality of time steps; aggregating, by a temporal transformer encoder implemented by the one or more processors, across the plurality of time steps; and generating, by the one or more processors based on the aggregating, a quality score associated with a parameter of the video.
The method may further comprise prepending a set of classification tokens to the set of multi-resolution input tokens. Alternatively or additionally, generating the quality score may include applying aggregated output from the temporal transformer encoder to a multi-layer perceptron model. The quality score may be a mean opinion score. Alternatively or additionally to the above, the encoding may include capturing both global video composition from the lower resolution frame and local details from the higher resolution frames.
The grouping and rescaling may include: dividing the neighboring input video frames by a group of N; and proportionally resizing the group of N to N different resolutions preserving a same aspect ratio; wherein an i-th frame is resized to shorter-side length i×l, where l is a smallest length. Alternatively or additionally, sampling the pyramid of multi-resolution frames to obtain a set of patches may include aligning patch grid centers for each frame. Here, during model training, the method includes randomly choosing a center for each frame along a middle line for a longer-length side, and for inference, using the center of the video frames.
Alternatively or additionally, Sampling the pyramid of multi-resolution frames to obtain a set of patches may include: from a first one of the neighboring input video frames, uniformly sampling grid patches to capture a complete global view; and for following ones of the neighboring input video frames, linearly sampling spaced-out patches to provide local details. Alternatively or additionally, a patch size P is the same for all of the multi-resolution frames in the pyramid. Here, for the i-th frame in the pyramid, the distance between patches is set to (i−1)×P. Alternatively or additionally, sampling the pyramid of multi-resolution frames to obtain a set of patches includes forming a tube of multi-resolution patches, the tube having the same center in through the pyramid of multi-resolution frames.
According to another aspect, a video processing system comprises memory configured to store imagery and one or more processors operatively coupled to the memory. The one or more processors are configured to: group and rescale neighboring input video frames of a single video into a pyramid of multi-resolution frames including both lower resolution frames and higher resolution frames; sample the pyramid of multi-resolution frames to obtain a set of patches; encode the set of patches as a set of multi-resolution input tokens; generate, by a spatial transformer encoder implemented by the one or more processors, a representation per frame group for a plurality of time steps; aggregate, by a temporal transformer encoder implemented by the one or more processors, across the plurality of time steps; and generate, based on the aggregating, a quality score associated with a parameter of the video.
The one or more processors may be further configured to prepend a set of classification tokens to the set of multi-resolution input tokens. Alternatively or additionally, generation of the quality score includes applying aggregated output from the temporal transformer encoder to a multi-layer perceptron model. Alternatively or additionally, encoding the set of patches includes capturing both global video composition from the lower resolution frame and local details from the higher resolution frames. Alternatively or additionally, grouping and rescaling neighboring input video frames includes: division of the neighboring input video frames by a group of N; and proportionally resizing the group of N to N different resolutions preserving a same aspect ratio; wherein an i-th frame is resized to shorter-side length i×l, where l is a smallest length.
Alternatively or additionally, the one or more processors may be configured to sample the pyramid of multi-resolution frames to obtain a set of patches by alignment of patch grid centers for each frame. Here, during model training, the one or more processors are configured to randomly choose a center for each frame along a middle line for a longer-length side, and for inference the one or more processors are configured to use the center of the video frames.
Alternatively or additionally, the one or more processors may be configured to sample the pyramid of multi-resolution frames to obtain a set of patches as follows: from a first one of the neighboring input video frames, uniformly sample grid patches to capture a complete global view; and for following ones of the neighboring input video frames, linearly sample spaced-out patches to provide local details. The one or more processors may be configured to sample the pyramid of multi-resolution frames to obtain a set of patches by formation of a tube of multi-resolution patches, the tube having the same center in through the pyramid of multi-resolution frames. A patch size P may be the same for all of the multi-resolution frames in the pyramid.
Alternatively or additionally, the one or more processors may be further configured to assign quality scores to different videos in order to prioritize the different videos for serving.
Aspects of the technology employ a Transformer-type approach on VQA tasks in order to effectively model the complex spacetime distortions common in UGC videos.
As presented in
The segments 104 for each frame 102 are then arranged into a multi-resolution video representation 106, as shown in matrix form on the right side of
As shown in example 120 of
Integrating both global and local video perception can result in a more accurate and comprehensive VQA model. Other vision or image-dependent tasks can also benefit from multi-scale features. This can include video classification, video recommendation, video search, etc. As a result, various system and applications can be enhanced to address video quality issues. For example, one could use the quality metrics to rank videos according to their quality scores in order to prioritize serving high quality videos. Alternatively or additionally, the quality scores can be used to filter poor quality videos. For video curation, video post-processing operations can be chosen that lead to a high quality score.
Human perceived video quality is affected by both the global video composition (e.g., content, video structure and smoothness) and local details (e.g., texture and distortion artifacts). It is hard to capture both global and local quality information when using fixed resolution inputs. Generally, video quality is affected by both global video composition and local details. Although downsampled video frames provide the global view and are easier to process for deep-learning models, some distortions visible on the original high-resolution videos may disappear when resized to a lower resolution. View 160 of
Incorporating multi-resolution views into the model as discussed herein enables the Transformer's self-attention mechanism to capture diverse quality information on both fine-grained local details and coarse-grained global views. The result is a multi-resolution Transformer suitable for video quality assessment (MRET), as well as other types of vision/image-dependent tasks. This includes a multi-resolution video frame representation and the corresponding multi-resolution patch sampling mechanism that enable the Transformer to capture multi-scale quality information in diverse UGC videos. As discussed further herein, MRET was tested on a number of large-scale UGC VQA datasets. Results show it outperforming conventional approaches by large margins on LSVQ and LSVQ-1080p. It was also able to achieve state-of-the-art performance on KoNVID-1k and LIVE-VQC without fine-tuning, thereby demonstrating its robustness and generalization capability
The MRET employs a self-attention architecture, e.g., the Transformer neural network encoder-decoder architecture. An exemplary general Transformer-type architecture is shown in
System 200 of
System 200 can perform any of a variety of tasks that require processing sequential inputs to generate sequential outputs. System 200 includes an attention-based sequence transduction neural network 206, which in turn includes an encoder neural network 208 and a decoder neural network 210. The encoder neural network 208 is configured to receive the input sequence 202 and generate a respective encoded representation of each of the network inputs in the input sequence. An encoded representation is a vector or other ordered collection of numeric values. The decoder neural network 210 is then configured to use the encoded representations of the network inputs to generate the output sequence 204. Generally, both the encoder 208 and the decoder 210 are attention-based. In some cases, neither the encoder nor the decoder includes any convolutional layers or any recurrent layers. The encoder neural network 208 includes an embedding layer (input embedding) 212 and a sequence of one or more encoder subnetworks 214. The encoder neural 208 network may N encoder subnetworks 214.
The embedding layer 212 is configured, for each network input in the input sequence, to map the network input to a numeric representation of the network input in an embedding space, e.g., into a vector in the embedding space. The embedding layer 212 then provides the numeric representations of the network inputs to the first subnetwork in the sequence of encoder subnetworks 214. The embedding layer 212 may be configured to map each network input to an embedded representation of the network input and then combine, e.g., sum or average, the embedded representation of the network input with a positional embedding of the input position of the network input in the input order to generate a combined embedded representation of the network input. In some cases, the positional embeddings are learned. As used herein, “learned” means that an operation or a value has been adjusted during the training of the sequence transduction neural network 206. In other cases, the positional embeddings may be fixed and are different for each position.
The combined embedded representation is then used as the numeric representation of the network input. Each of the encoder subnetworks 214 is configured to receive a respective encoder subnetwork input for each of the plurality of input positions and to generate a respective subnetwork output for each of the plurality of input positions. The encoder subnetwork outputs generated by the last encoder subnetwork in the sequence are then used as the encoded representations of the network inputs. For the first encoder subnetwork in the sequence, the encoder subnetwork input is the numeric representations generated by the embedding layer 212, and, for each encoder subnetwork other than the first encoder subnetwork in the sequence, the encoder subnetwork input is the encoder subnetwork output of the preceding encoder subnetwork in the sequence.
Each encoder subnetwork 214 includes an encoder self-attention sub-layer 216. The encoder self-attention sub-layer 216 is configured to receive the subnetwork input for each of the plurality of input positions and, for each particular input position in the input order, apply an attention mechanism over the encoder subnetwork inputs at the input positions using one or more queries derived from the encoder subnetwork input at the particular input position to generate a respective output for the particular input position. In some cases, the attention mechanism is a multi-head attention mechanism as shown. In some implementations, each of the encoder subnetworks 214 may also include a residual connection layer that combines the outputs of the encoder self-attention sub-layer with the inputs to the encoder self-attention sub-layer to generate an encoder self-attention residual output and a layer normalization layer that applies layer normalization to the encoder self-attention residual output. These two layers are collectively referred to as an “Add & Norm” operation in
Some or all of the encoder subnetworks can also include a position-wise feed-forward layer 218 that is configured to operate on each position in the input sequence separately. In particular, for each input position, the feed-forward layer 218 is configured receive an input at the input position and apply a sequence of transformations to the input at the input position to generate an output for the input position. The inputs received by the position-wise feed-forward layer 218 can be the outputs of the layer normalization layer when the residual and layer normalization layers are included or the outputs of the encoder self-attention sub-layer 216 when the residual and layer normalization layers are not included. The transformations applied by the layer 218 will generally be the same for each input position (but different feed-forward layers in different subnetworks may apply different transformations).
In cases where an encoder subnetwork 214 includes a position-wise feed-forward layer 218 as shown, the encoder subnetwork can also include a residual connection layer that combines the outputs of the position-wise feed-forward layer with the inputs to the position-wise feed-forward layer to generate an encoder position-wise residual output and a layer normalization layer that applies layer normalization to the encoder position-wise residual output. As noted above, these two layers are also collectively referred to as an “Add & Norm” operation. The outputs of this layer normalization layer can then be used as the outputs of the encoder subnetwork 214.
Once the encoder neural network 208 has generated the encoded representations, the decoder neural network 210 is configured to generate the output sequence in an auto-regressive manner. That is, the decoder neural network 210 generates the output sequence, by at each of a plurality of generation time steps, generating a network output for a corresponding output position conditioned on (i) the encoded representations and (ii) network outputs at output positions preceding the output position in the output order. In particular, for a given output position, the decoder neural network generates an output that defines a probability distribution over possible network outputs at the given output position. The decoder neural network can then select a network output for the output position by sampling from the probability distribution or by selecting the network output with the highest probability.
Because the decoder neural network 210 is auto-regressive, at each generation time step, the decoder network 210 operates on the network outputs that have already been generated before the generation time step, i.e., the network outputs at output positions preceding the corresponding output position in the output order. In some implementations, to ensure this is the case during both inference and training, at each generation time step the decoder neural network 210 shifts the already generated network outputs right by one output order position (i.e., introduces a one position offset into the already generated network output sequence) and (as will be described in more detail below) masks certain operations so that positions can only attend to positions up to and including that position in the output sequence (and not subsequent positions). While the remainder of the description below describes that, when generating a given output at a given output position, various components of the decoder 210 operate on data at output positions preceding the given output positions (and not on data at any other output positions), it will be understood that this type of conditioning can be effectively implemented using shifting.
The decoder neural network 210 includes an embedding layer (output embedding) 220, a sequence of decoder subnetworks 222, a linear layer 224, and a softmax layer 226. In particular, the decoder neural network can include N decoder subnetworks 222. However, while the example of
In some implementations, the embedding layer 220 is configured to map each network output to an embedded representation of the network output and combine the embedded representation of the network output with a positional embedding of the output position of the network output in the output order to generate a combined embedded representation of the network output. The combined embedded representation is then used as the numeric representation of the network output. The embedding layer 220 generates the combined embedded representation in the same manner as described above with reference to the embedding layer 212.
Each decoder subnetwork 222 is configured to, at each generation time step, receive a respective decoder subnetwork input for each of the plurality of output positions preceding the corresponding output position and to generate a respective decoder subnetwork output for each of the plurality of output positions preceding the corresponding output position (or equivalently, when the output sequence has been shifted right, each network output at a position up to and including the current output position). In particular, each decoder subnetwork 222 includes two different attention sub-layers: a decoder self-attention sub-layer 228 and an encoder-decoder attention sub-layer 230. Each decoder self-attention sub-layer 228 is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position and, for each of the particular output positions, apply an attention mechanism over the inputs at the output positions preceding the corresponding position using one or more queries derived from the input at the particular output position to generate a updated representation for the particular output position. That is, the decoder self-attention sub-layer 228 applies an attention mechanism that is masked so that it does not attend over or otherwise process any data that is not at a position preceding the current output position in the output sequence.
Each encoder-decoder attention sub-layer 230, on the other hand, is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position and, for each of the output positions, apply an attention mechanism over the encoded representations at the input positions using one or more queries derived from the input for the output position to generate an updated representation for the output position. Thus, the encoder-decoder attention sub-layer 230 applies attention over encoded representations while the decoder self-attention sub-layer 228 applies attention over inputs at output positions.
In the example of
Some or all of the decoder subnetwork 222 also include a position-wise feed-forward layer 232 that is configured to operate in a similar manner as the position-wise feed-forward layer 218 from the encoder 208. In particular, the layer 232 is configured to, at each generation time step: for each output position preceding the corresponding output position: receive an input at the output position, and apply a sequence of transformations to the input at the output position to generate an output for the output position. The inputs received by the position-wise feed-forward layer 232 can be the outputs of the layer normalization layer (following the last attention sub-layer in the subnetwork 222) when the residual and layer normalization layers are included or the outputs of the last attention sub-layer in the subnetwork 222 when the residual and layer normalization layers are not included. In cases where a decoder subnetwork 222 includes a position-wise feed-forward layer 232, the decoder subnetwork can also include a residual connection layer that combines the outputs of the position-wise feed-forward layer with the inputs to the position-wise feed-forward layer to generate a decoder position-wise residual output and a layer normalization layer that applies layer normalization to the decoder position-wise residual output. These two layers are also collectively referred to as an “Add & Norm” operation. The outputs of this layer normalization layer can then be used as the outputs of the decoder subnetwork 222.
At each generation time step, the linear layer 224 applies a learned linear transformation to the output of the last decoder subnetwork 222 in order to project the output of the last decoder subnetwork 222 into the appropriate space for processing by the softmax layer 226. The softmax layer 226 then applies a softmax function over the outputs of the linear layer 224 to generate the probability distribution (output probabilities) 234 over the possible network outputs at the generation time step. The decoder 210 can then select a network output from the possible network outputs using the probability distribution.
UGC video quality is highly diverse since UGC videos are captured under very different conditions like unstable/shaky cameras, imperfect camera lens, varying resolutions and frame rates, different algorithms and parameters for processing and compression. As a result, UGC videos usually contain a mixture of spatial and temporal distortions. Moreover, the way viewers perceive content and distortions may also impact the final perceptual quality of the video. Sometimes transient distortions such as sudden glitches and defocusing can significantly affect the overall perceived quality, which makes the problem even more complicated. Due to the diversity of spatial-temporal distortions and the complexity of human perception, quality assessment for UGC videos should be taken into account both global video composition and local details. The MRET architecture captures video quality at different granularities. According to one aspect for VQA, the architecture embeds video clips as multi-resolution patch tokens.
MRET is comprised of two main parts, namely a multi-resolution video embedding module, and a space-time factorized Transformer encoding module. According to one aspect, MRET need not employ a decoder such as decoder neural network 210 discussed above. The multi-resolution video embedding module is configured to encode the multi-scale quality information in the video, capturing both global video composition from lower resolution frame and local details from larger resolution frames. The space-time factorized Transformer encoding module aggregates the spatial and temporal quality from the multiscale embedding input.
UGC videos are produced from highly diverse conditions and therefore may contain complex distortions and diverse resolutions. In order to capture both global and local quality information, the input video is transformed into groups of multi-resolution frames. Multiple neighboring input frames are grouped together and rescaled into a pyramid of low-resolution and high-resolution frames (e.g., 304 in
To obtain the multi-resolution video representation, the system first divides the input frames by a group of N. The N frames in the group are first proportionally resized to N different resolutions preserving the same aspect ratio. The i-th frame is resized to shorter-side length i×l accordingly, where l is the smallest length. This results in a pyramid of rescaled frames with shorter-side length as 1×l; 2×l, . . . . N×l as shown in view 400 of
Intuitively, the lower-resolution frames in the multiresolution pyramid provide global views of the video composition, and the higher-resolution ones contain local details. These views are complementary to each other. Low-resolution frames can be processed efficiently while processing the high-resolution frames in its entirety can be computationally expensive. Therefore, the system relies on low-resolution frames for a global view, and performs patch sampling on high-resolution frames to provide the local details. Moreover, the Transformer should be provided with spatially aligned global and local views to allow it to better aggregate multi-scale information across locations. To achieve this, spatially aligned grid of patches may be sampled from the grouped multi-resolution frames.
As a result, a “tube” of multi-resolution patches is formed, which is shown in the multi-resolution video frames embedding view 500 of
In particular, the system linearly projects each tube of multi-resolution patch xi to a 1D token zi∈d using learned matrix E. Here, d is the dimension of the Transformer input tokens. This can be implemented using a 3D convolution with kernel size N×P×P. Each embedded token contains multi-resolution patches at the same location, allowing the model to utilize both global and local spatial quality information. Moreover, the multi-scale patches also fuse local spatial-temporal information together during tokenization. Therefore, it provides a comprehensive representation for the input video.
As shown in
The spatial Transformer encoder 324 aggregates the multiresolution patches extracted from the entire frame group to a representation hk∈d at its time step. k=1, . . . T is the temporal index for the frame group. T is the number of frame groups. As mentioned above, for multi-resolution patches xi from each frame group, it is projected to a sequence of multi-resolution tokens as zi∈
d, i=1, . . . , M using learnable matrix E. M is the total number of patches. The system may prepend an extra learnable “classification token” zcis∈
d and use its representation at the final encoder layer as the final spatial representation for the frame group. A learnable spatial positional embedding p∈
M×d may be added element-wisely to the input tokens zi to encode spatial positional information. The tokens are then passed through a Transformer encoder with L layers. Each layer q consists of multi-head self-attention (MSA), layer normalization (LN), and MLP blocks. The spatial Transformer encoder can therefore be formulated as:
The temporal Transformer encoder models the interactions between tokens from different time steps. After getting the frame group level representations hk, i=1, . . . , Tat each temporal index, the system prepends a hcis ∈d token. A separate learnable temporal positional embedding ∈
T×d is also added. The output tokens are then fed to the temporal Transformer encoder 326. The output representation at the hcis token is used as the final representation for the whole video.
To predict the final quality score 320, MLP layer 328 is added on top of the final representation output from the hcis token position from the temporal encoder 326. The output of the MLP layer 328 can be regressed to the video mean opinion score (MOS) label associated with each video in VQA datasets. In one scenario, the model is trained end-to-end with L2 loss.
Initialization from Pretrained Models
Available video quality datasets may be several magnitudes smaller than large-scale image classification datasets, such as ILSVRC-2012 ImageNet and ImageNet-21k. Given this, training Transformer models from scratch using VQA datasets can be extremely challenging and impractical. Therefore, according to one aspect the Transformer backbone may be initialized from pretrained image models.
Unlike 3D video input, image Transformer models only need 2D projection for the input data. To initialize the 3D convolutional filter E from 2D filters Eimage in pretrained image models, a “central frame initialization strategy” cam be adopted. In short, E is initialized with zeros along all temporal positions, except at the center └N/2┘. The initialization of E from pretrained image model can therefore be formulated as:
As shown in the test results below, experiments were run on four UGC VQA datasets, including LSVQ, LSVQ-1080p, KoNVID-1k, and LIVE-VQC. LSVQ (excluding LSVQ-1080p) included 38,811 UGC videos and 116,433 space-time localized video patches. The original and patch videos were all annotated with MOS scores in [0.0, 100.0], and it contained videos of diverse resolutions. The LSVQ-1080p set contained 3,573 videos with 1080p resolution or higher. Since in one scenario the model does not make a distinction between original videos and video patches, all the 28.1k videos and 84.3k video patches from the LSVQ training split were used to train the model. The model was evaluated on full-size videos from the testing splits of LSVQ and LSVQ-1080p. KoNVID-1k contained 1,200 videos with MOS scores in [0.0, 5.0] and 960p fixed resolution. LIVE-VQC contained 585 videos with MOS scores in [0.0, 100.0] and video resolution from 240p to 1080p. KoNVID-1k and LIVE-VQC were used for evaluating the generalization ability of the model without fine-tuning. Since no training was involved in that testing, the entire dataset was used for evaluation.
For testing, the number of multi-resolution frames in each group was set to N=4. The shorter-side length l was set to 224 for the first (smallest) frame in the frame group. Correspondingly, the following three frames were rescaled with shorter-side (pixel) length 448, 672, and 896. Patch size P=16 was used when generating the multi-resolution frame patches. For each frame, a 14_14 grid of patches was sampled. Unless otherwise specified herein, the input to the network was a video clip of 128 frames uniformly sampled from the video.
The hidden dimension for Transformer input tokens for testing was set to d=768. For the spatial Transformer encoder (e.g., 324 in
The models were trained with the synchronous SGD momentum optimizer, a cosine decay learning rate schedule from 0.3 and a batch size of 256 for 10 epochs in total. All the models were trained for testing on tensor processing unit, version 3 (TPUv3) hardware. Spearman rank ordered correlation (SRCC) and Pearson linear correlation (PLCC) were reported as performance metrics.
Table 1 in
To verify the generalization capability of the MRET model, a cross-dataset evaluation was conducted in which the model was trained using LSVQ training set and separately evaluated on LIVE-VQC and KoNVID-1k without fine-tuning. As shown in Table 2 of
To understand how MRET aggregates spatial-temporal information to predict the final video quality, one can visualize the attention weights on spatial and temporal tokens using Attention Rollout, as explained by Abnar and Zuidema in “Quantifying attention flow in transformers”, 2020, the disclosure of which is incorporated by reference herein. In short, the attention weights of the Transformer are averaged across all heads and then the weight matrices of all layers are recursively multiplied.
As shown by temporal attention for the duck video in
To verify the effectiveness of the multi-resolution input representation, ablations were run by not using the multi-resolution input. The comparison result is shown in Table 3 of
In this test, For MRET the frames were resized to [224, 448, 672, 896] for shorter-side lengths. For the method “w/o Multiresolution”, all the frames in the frame group were resized to the same shorter-side length, which is 224. The GFLOPS are the same for both models because the patch size and number of patches are the same. As shown in the table, the multi-resolution frame input brings 1-2% boost in SRCC on LSVQ and 2-3% boost in SRCC on LSVQ-1080p. The gain is larger on LSVQ-1080p because the dataset contains more high-resolution videos, and therefore more quality information is lost when resized statically to a small resolution.
Armed with the multi-resolution input representation, MRET is able to utilize both global information from lower-resolution frames and detailed information from higher-resolution frames. The results demonstrate that the proposed multi-resolution representation is indeed effective for capturing the complex multi-scale quality information that can be lost when using statically resized frames. Table 3 also shows that both models' performance improves with the increase of number of input frames since more temporal information is preserved.
In Table 4 of
This further verifies the validity of the multiresolution input structure of the MRET arrangement. For multi-resolution input, the performance improved when increasing N from 2 to 5, but the gain became smaller as N grew larger. There is also a trade-off between getting higher resolution views and the loss of spatial and temporal information with the increase of N, since the area ratio of sampled patches becomes smaller as resolution increases. Overall, it was found for N=4 to be a good balance between performance and complexity.
Compared to CNNs, Transformers impose less restrictive inductive biases, which broadens their representation ability. However, since the basic Transformer architecture lacks the inductive biases of the 2D image structure, it generally needs large datasets for pretraining to learn the inductive priors. In Table 5 of
Table 6 of
The models may be trained on one or more tensor processing units (TPUs), CPUs or other computing architectures in order to implement a multi-resolution Transformer for video imagery in accordance with the features disclosed herein. One example computing architecture is shown in
As shown in
The processors may be any conventional processors, such as commercially available CPUs, TPUs, graphical processing units (GPUs), etc. Alternatively, each processor may be a dedicated device such as an ASIC or other hardware-based processor. Although
The input data, such as one or more videos or sets of videos, may be operated on by a multi-resolution Transformer module to generate one or more multi-resolution video frame representations, video quality assessment data, etc. The client devices may utilize such information in various apps or other programs to perform video quality assessment or other metric analysis, video recommendations, video classification, video search, etc. This could include assigning quality scores to different videos based upon the results of MRET processing, for instance to contrast videos that have a high level of artifacts from other videos with a lower level of artifacts. Therefore, quality scores can be used to rank videos and prioritize good videos in serving, such as for a video streaming service. Video processing or editing operations may be selected that lead to a better quality score.
The computing devices may include all of the components normally used in connection with a computing device such as the processor and memory described above as well as a user interface subsystem for receiving input from a user and presenting information to the user (e.g., text, imagery and/or other graphical elements). The user interface subsystem may include one or more user inputs (e.g., at least one front (user) facing camera, a mouse, keyboard, touch screen and/or microphone) and one or more display devices (e.g., a monitor having a screen or any other electrical device that is operable to display information (e.g., text, imagery and/or other graphical elements). Other output devices, such as speaker(s) may also provide information to users.
The user-related computing devices (e.g., 1312-1314) may communicate with a back-end computing system (e.g., server 1302) via one or more networks, such as network 1310. The network 910, and intervening nodes, may include various configurations and protocols including short range communication protocols such as Bluetooth™, Bluetooth LE™, the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, WiFi and HTTP, and various combinations of the foregoing. Such communication may be facilitated by any device capable of transmitting data to and from other computing devices, such as modems and wireless interfaces.
In one example, computing device 1302 may include one or more server computing devices having a plurality of computing devices, e.g., a load balanced server farm or cloud computing system, that exchange information with different nodes of a network for the purpose of receiving, processing and transmitting the data to and from other computing devices. For instance, computing device 1302 may include one or more server computing devices that are capable of communicating with any of the computing devices 1312-1314 via the network 1310.
Video quality assessment information or other data derived from the multi-resolution Transformer module(s), the module(s) itself, multi-resolution video frames or other representations, or the like may be shared by the server with one or more of the client computing devices. Alternatively or additionally, the client device(s) may maintain their own databases, MRET modules, etc.
As explained above, a multi-resolution Transformer (MRET) is provided for video quality assessment and other video-related applications and services. The MRET integrates multi-resolution views to capture both global and local quality information. By transforming the input frames to a multi-resolution representation with both low and high resolution frames, the model is able to capture video quality information at different granularities. A multi-resolution patch sampling mechanism is provided to effectively handle the variety of resolutions in the multi-resolution input sequence. A factorization of spatial and temporal Transformers is employed to efficiently model spatial and temporal information and capture complex space-time distortions in UGC videos. Experiments on several large-scale UGC VQA datasets have shown that MRET can achieve state-of-the-art performance and has strong generalization capability, demonstrating the effectiveness of the proposed method. While MRET is particularly beneficial for VQA-related applications, it can be applied to other scenarios such as where the task labels are affected by both video global composition and local details. Finally, the Transformer-based architecture can be modified to handle larger numbers of input tokens.
Although the technology herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present technology. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present technology as defined by the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/021539 | 3/23/2022 | WO |