Due to its nearly universal popularity as a content medium, ever more visual media content is being produced and grade available to consumers. As a result, the efficiency with which visual images can be analyzed, classified, and processed has become increasingly important to the producers, owners, and distributors of that visual media content.
One significant challenge to the efficient classification and processing of visual media content is that entertainment and media studios produce many different types of content having differing features, such as different visual textures and movement. In the case of audio-video (AV) film and television content, for example the content produced may include live action content with realistic computer-generated imagery (CGI) elements, high complexity three-dimensional (3D) animation, and even two-dimensional (2D) hand-drawn animation. Moreover, each different type of content produced may require different treatment in pre-production, post-production, or both.
Consider, for example, the post-production treatment of AV or video content. Different types of AV or video content may benefit from different encoding schemes for streaming, or different workflows for localization. In the conventional art, the classification of content as being of a particular type is typically done manually, through human inspection, and in the example use case of video encoding, the most appropriate workflow may not be identifiable even after manual inspection, but may require trial and error to determine how to classify the content for encoding purposes. This classification process can be particularly challenging for mixed content types, such as animation embedded in otherwise live action content, or for visually complex 3D animation Inch may be better suited post-processing using live action content workflows than traditional animation workflows.
The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
As noted above, entertainment and media studios produce many different types of content having differing features, such as different visual textures and movement. In the case of audio-video (AV) or video content, for example the content produced may include live action content with realistic computer-generated imagery (CGI) elements, high complexity three-dimensional (3D) animation, and even two-dimensional (2D) hand-drawn animation. Each different type of content produced may require different treatment in pre-production, post-production, or both.
As further noted above, in the post-production treatment of AV or video content, different types of video content may benefit from different encoding schemes for streaming, or different workflows for localization. In the conventional art, the classification of content as being of a particular type is done manually, through human inspection, and in the example use case of video encoding, the most appropriate workflow may not be identifiable even after manual inspection, but may require trial and error to determine how to categorize the content for encoding purposes. This classification process can be particularly challenging for mixed content types, such as animation embedded in otherwise live action content, or for visually complex 3D animation which may be better suited for post-processing using live action content workflows than traditional animation workflows.
The present application discloses systems and methods for performing machine learning (ML) model based embedding for adaptable content evaluation. It is noted that the disclosure provided in the present application focuses on optimizations within the encoding pipeline for video streaming. Examples of tasks under consideration include 1) selection of pre- and post-processing algorithms or algorithm parameters, 2) automatic encoding parameter selection per title or per segment, and 3) automatic bitrate ladder selection for adaptive streaming per title or per segment. However, the present ML model based adaptable evaluation solution is task-independent and can be used in contexts that are different from those specifically described herein.
Thus, although the present adaptable content evaluation solution is described below in detail by reference to the exemplary use case of video encoding in the interests of conceptual clarity, the present novel and inventive principles may more generally be utilized in a variety of other content post-production processes, such as colorization, color correction, content restoration, mastering, and audio cleanup or synching, to name a few examples, as well as in pre-production processing. Moreover, the adaptable content evaluation solution disclosed in the present application may advantageously be implemented as an automated process.
As defined in the present application, the terms “automation,” “automated,” and “automating” refer to systems and processes that do not require human intervention. Although in some implementations a human editor may review the content evaluations performed by the systems and using the methods described herein, that human involvement is optional. Thus, the methods described in the present application may be performed under the control of hardware processing components of the disclosed automated systems.
Moreover, as defined in the present application, the expression “ML model” may refer to a mathematical model for making future predictions based on patterns learned from samples of data or “training data.” Various learning algorithms can be used to map correlations between input data and output data. These correlations form the mathematical model that can be used to make future predictions on new input data. Such a predictive model gray include one or more logistic regression models, Bayesian models, or neural networks (NNs).
A “deep neural network,” in the context of deep learning, may refer to an NN that utilizes a plurality of hidden layers between input and output layers, which may allow for learning based on features not explicitly defined in raw data. As used in the present application, a feature identified as an NN refers to a deep neural network.
As further shown in
Although the present application refers to one or more of software code 110, ML model(s) 120, and content and classification database 112 as being stored in system memory 106 for conceptual clarity, more generally, system memory 106 may take the form of any computer-readable non-transitory storage medium. The expression “computer-readable non-transitory storage medium,” as defined in the present application, refers to any medium, excluding a carrier wave or other transitory signal that provides instructions to processing hardware 104 of computing platform 102. Thus, a computer-readable non-transitory storage medium may correspond to various types of media, such as volatile media and non-volatile media, for example. Volatile media may include dynamic memory, such as dynamic random access memory (dynamic RAM) while non-volatile memory may include optical, magnetic, or electrostatic storage devices. Common forms of computer-readable non-transitory storage media include, for example, optical discs, RAM, programmable read-only memory (PROM), erasable PROM (EPROM), and FLASH memory.
Moreover, although
Furthermore. although
Processing hardware 104 may include a plurality of hardware processing units, such as one or more central processing units, one or more graphics processing units, and one or more tensor processing units one or more field-programmable gate arrays (FPGAs), and an application programming interface (API) server, for example. By way of definition, as used in the present application, the terms “central processing unit” (CPU), “graphics processing unit” (GPU), and “tensor processing unit” (TPU) have their customary meaning in the art. That is to say, a CPU includes an Arithmetic Logic Unit (ALU) for carrying out the arithmetic and logical operations of computing platform 102, as well as a Control Unit (CU) for retrieving programs, such as software code 110, from system memory 106, while a GPU may be implemented to reduce the processing overhead of the CPU by performing computationally intensive graphics or other processing tasks. A TPU is an application-specific integrated circuit (ASIC) configured specifically for artificial intelligence (AI) processes such as machine learning.
In some implementations, computing platform 102 may correspond to one or more web servers, accessible over communication network 114 in the form of a packet-switched network such as the Internet, for example. Moreover, in some implementations, communication network 114 may be a high-speed network suitable for high performance computing (UPC), for example a 10 GigE network or an Infiniband network. In some implementations, computing platform 102 may correspond to one or more computer servers supporting a private wide area network (WAN), local area network (LAN), or included in another type of limited distribution or private network. As yet another alternative, in some implementations, system 100 may be implemented virtually, such as in a data center. For example, in some implementations, system 100 may be implemented in software, or as virtual machines.
Although user system 130 is shown as a desktop computer in
With respect to display 132 of user system 130, display 132 may be implemented as a liquid crystal display (LCD), light-emitting diode (LED) display, organic light-emitting diode (OLED) display, quantum dot (QD) display, or any other suitable display screen that perform. a physical transformation of signals to light. Furthermore, display 132 may be physically integrated with user system 130 or may be communicatively coupled to but physically separate from user system 130. For example, where user system 130 is implemented as a smartphone, laptop computer, or tablet computer, display 132 will typically be integrated with user system 130. By contrast, where user system 130 is implemented as a desktop computer, display 132 may take the form of a monitor separate from user system 130 in the form of a computer tower.
Input data 128, as well as training data 124, may include segmented content in the form of video snippets (e.g., sampling of frames), including raw frames, encoded frames, or both. In addition, in some implementations, input data 128, training data 124, or both, may be augmented with additional data, such as one or more of encoding statistics, distortion maps or metrics, pre-computed features such as per-pixel noise or texture information, for example, or any combination thereof. Thus, in various implementations, input data 128 and training content 124 may be 3D (e.g., in the case of video), 2D (e.g., in case of individual frames), 1D (e.g., in case of per-frame values), or even single variable for a segment.
In the case of AV or video content, input data 128 and training data 124 may include content segmented by shot, scene, timecode interval, or as individual video frames. Regarding the term “shot,” as defined for the purposes of the present application, a “shot” refers to a continuous series of video frames that are captured from a u C camera perspective without cuts and other cinematic transitions, while a scene typically includes a plurality of shots. Alternatively, input data 128 and training data 124 may include content segmented using the techniques described in U.S. Patent Application Publication Number 2021/0076045, published on Mar. 11, 2021, and titled “Content Adaptive Boundary Placement for Distributed Encodes,” which is hereby incorporated fully by reference into the present application. It is noted that, in various implementations, input data 128 and training data 124 may include video content without audio, audio content without video, AV content, text, or content having any other format.
ML model(s) 120 include an ML model based embedder trained using training data 124 that is selected based on one or more similarity metrics. Such similarity metrics may be metrics that are quantitatively similar, i.e., objectively similar, or may be metrics that are perceptually similar, i.e., subjectively similar under human inspection. Examples of perceptual similarity metrics for AV and video content may include texture, motion, and perceived encoding quality, to name a few. Examples of quantitative similarity metrics for AV and video content may include rate distortion curves, pixel density, computed optical flow, and computed encoding quality, also to name merely a few.
Referring to
As shown in
As show in
As shown in
ML model based embedder 226 is responsible for mapping content segments to embeddings. Example implementations for ML model based embedder 226 include but are not restricted to one or more of a 1D, 2D, or 3D convolutional neural network (CNN) with early or late fusion, that is/are trained from scratch or is/are pre-trained to leverage transfer learning. Depending on the target task, features extracted from different layers of a pre-trained CNN, such as the last layer of a Visual Geometry Group (VGG) CNN or Residual Network (ResNet) might be used to shape the embedding.
Classification or regression block 260 is responsible for performing classification or regression tasks, such as selection of pre-processing or post-processing algorithms or parameters, rate-distortion prediction where distortion can be Measured with various quality metrics, per title or per segment automatic encoding parameter selection, and per title or per segment automatic bitrate ladder selection for adaptive streaming prediction of the highest bitrate a given title or segment needs to be encoded in to reach a certain perceptual quality, to name a few examples.
Classification or regression block 260 can be implemented as, for instance a similarity metric plus threshold to determine what cluster of embeddings a particular embedding belongs to among different cluster groups available in a continuous vector space into which the embeddings are mapped. Alternatively, classification or regression block 260 can be implemented as a neural network (NN) or other ML model architecture included among ML model(s) 120 that is trained to classify or regress the embedding to the ground truth result. Moreover, in some implementations, classification or regression block 260 may be integrated with ML, model based embedder 226, and may serve as one or more layers of ML model based embedder 226, for example.
The contrastive learning processes depicted in
The training of ML model based embedder 226, classification or regression block 260, or both ML model based embedder 226 and classification or regression block 260, may be performed by software code 110, executed by processing hardware 104 of computing platform 102. In some use cases, ML model based embedder 226 and classification or regression block 260 may be trained independently of one another.
Alternatively, in some implementations, ML model based embedder 226 may be trained first, and then the embeddings provided by ML model based embedder 226 may be used for different downstream classification or regression tasks. In such a case, ML model based embedder 226 may be trained by a) identifying content segments that are deemed similar, and feed them in as training data while minimizing a distance function between them, b) identifying two content segments that are deemed dissimilar, and feed them in as training while maximizing a distance function between them, and c) repeating steps a) and b) for the length of the training.
As yet another alternative in some implementations ML based embedder 226 and classification or regression block 260 may be trained together. Moreover, in some implementations in which ML model based embedder 226 and classification or regression block 260 are trained together, such as where classification or regression block 260 is integrated with ML model based embedder 226 as one or more layers of ML, model based embedder 226, for example, ML model based embedder 226 including classification or regression block 260 may be trained using end-to-end learning.
After training of ML model based embedder 226, classification or regression block 228, or both ML model based embedder 226 and classification or regression block 260, is complete, processing hardware 104 may execute software code 111 to receive input data 128 from user system 130, and to use ML model based embedder 226 to transform input data 128 or segments thereof to a vector representation (hereinafter “embedding”) of the content mapped into a continuous one dimensional or multi-dimensional vector space, resulting in an embedded vector representation of the content in that vector space.
In addition to using ML model based embedder 226 to map embeddings 352a-352j onto continuous multi-dimensional vector space 350, software code 110, when executed by processing hardware 104, may further perform an unsupervised clustering process to identify clusters each corresponding respectively to a different content category with respect to the similarity metric being used to compare content.
In the case of AV and video content, for example embeddings 352a and 352b of live action content are mapped to a region of multi-dimensional vector space 350 identified by cluster 354c, while embeddings 352c, 352d, 352f, and 352h of low complexity animation content such as hand-drawn and other two-dimensional (2D) animation are mapped to different regions of continuous multi-dimensional vector space 350 identified by clusters 354a and 354d. Embeddings 352e, 352i, and 352j of high complexity animation content such as 3D animation is shown to be mapped to yet a different region of continuous multi-dimensional vector space 350 identified by cluster 354b. It is noted that embedding 352g of a mixed content type, such as animation mixed with live action, for example, may be mapped to the border of either an animation cluster, a live action cluster, or between such clusters.
In use cases in which the content corresponding to embeddings 352a-352j is AV or video content and the process for which that content is being categorized is video encoding, for example, each of clusters 354a-354d may correspond to a different codec. For instance cluster 354c may identify content for which a high bit-rate codec is required, while cluster 354a may identify content for which a low bit-rate codec is sufficient. Clusters 354b and 354d may identify content with other specific codecs. In one such implementation in which a new codec is introduced or an existing codec is retired, system 100 may be configured to automatically re-evaluate embeddings 352a-352j relative to the changed group of available codecs. An analogous re-evaluation may be performed for any other process to which the present concepts are applied.
It is noted that the continuity of multi-dimensional vector space 350 advantageously enables adjustment of the way in which embedding corresponding to content are clustered into categories per individual use case through utilization of different clustering algorithms and thresholds. In contrast to conventional classifications methods, which depend on a priori knowledge of the number of classification labels to be trained for, the present novel and inventive embedding approach can be adaptably used for a plurality of classification schemes. Moreover, due to the unsupervised nature of the clustering performed as part of the present adaptable content evaluation solution, the approach disclosed in the present application can yield unanticipated insights into similarities among items of content that may appear superficially to be different.
The functionality of system 100 will be further described by reference to
Referring now to
Referring to
As discussed above, ML model based embedder 226 may be trained using contrastive learning based on one or more similarity metrics. As also discussed above, the one or more similarity metrics on which contrastive learning by ML model based embedder 226 may include a quantitative similarity metric, a perceptual similarity metric, or both. In some implementations, as noted above, ML model based embedder 226 may include one or more of a ID CNN, a 2D CNN, or a 3D CNN, for example, with early or late fusion. Furthermore, with respect to continuous vector space 350, it is noted that in some implementations continuous vector space 350 may be multi-dimensional, as shown in
Flowchart 470 further includes performing one of a classification or a regression of the content segments using mapped embeddings (e.g., embeddings 352a-352j) (action 473). In some implementations, as described above by reference to
As also discussed above, in some implementations, the classification or regression performed in action 473 may be performed using classification or regression block 260, which may take the form of a trained neural network (NN) or other ML model architecture included among ML model(s) 120. Moreover and as further discussed above, in some implementations, classification or regression block 260 may be integrated with ML model based embedder 226, and may serve as one or more layers of ML model based embedder 226, for example. Action 473 may be performed by software code 110, executed by processing hardware 104 of computing platform 102, and in some implementations, using classification or regression block 260.
Flowchart 470 further includes discovering, based on the one of the classification or the regression performed in action 473, at least one new label for characterizing the plurality of content segments received in action 471 (action 474). For example, and as described above by reference to
For example, the at least one new label discovered in action 474 may advantageously result in implementation of new, and more effective, AV or video encoding parameters. In addition, or alternatively, information discovered as part of action 474 may be used to selectively enable one or more presently unused encoding parameters, as well as to selectively disable one or more encoding parameters presently in use. As yet another example, the information discovered as part of action 474 may be used to selectively turn-on or turn-off certain preprocessing elements in the transcoding pipeline, such as denoising or debanding, for example, based on content characteristics. Action 474 may be performed by software code 110, executed by processing hardware 104 of computing platform 102.
In some implementations, the method outlined by flowchart 470 may conclude with action 474 described above. However, in other implementations, the method outlined by flowchart 470 may also include further training ML model based embedder 226 using contrastive learning and the at least one new label discovered in action 474 (action 475). That is to say, action 475 is optional. When included in the method outlined by flowchart 470, action 475 may be performed by software code 110, executed by processing hardware 104 of computing platform 102, and may advantageously result in refining and improving the future. classification or regression performance of system 100. With respect to the actions included in flowchart 470, it is noted that actions 471, 472, 473, and 474 (hereinafter “actions 471-474”), or actions 471-474 and 475, may be performed as an automated process from which human involvement may be omitted.
Thus, the present application discloses systems and methods for performing ML model based embedding for adaptable content evaluation. The solution disclosed in the present application advances the state-of-the-art by addressing and providing solutions for situations in which a downstream task lacks annotated clear labeling, but it is desirable to discover how and why different things are responding differently to this task. The novel and inventive concepts disclosed in the present application can be used to automate the process of discovering appropriate labels for these differences (auto-discovery), and to be able to train a model so that a prediction can be generated about which one of those different groupings something is going to fall into, given a particular set of parameters for that downstream task.
As described above, the present application discloses two approaches to addressing the problem of auto-discovery of labels that can build on each other. In the first approach, as described by reference to
In the second approach, the first approach described above may be further supplemented by adding a classification or regression block for different types of downstream tasks, as described above by reference to
From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described herein, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.
The present application claims the benefit of and priority to a pending Provisional Patent Application Ser. No. 63/165,924, filed Mar. 25, 2021, and tided “Video Embedding for Classification,” which is hereby incorporated fully by reference into the present application.
Number | Date | Country | |
---|---|---|---|
63165924 | Mar 2021 | US |