Videos are a rich source of information about the world, conveyed through visual and auditory channels. Although spoken language is influential in communicating high-level semantic information in videos, the sound landscapes of different scenes may contain much more. For instance, in some cases, the words spoken by actors may often vary arbitrarily even as other auditory and visual aspects of the scene remain tightly coupled.
For example, in one scenario, two entirely different conversations may be occurring in the same café or other setting. In audiovisual representation learning, these simultaneous conversations may distract from more important scene-level correlations between visual and auditory channels. These scene-level correlations may include object sounds, spatial correspondences, acoustics of environments, non-linguistic attributes of speech, or other types of correlations. Despite this, machine learning models that encode scenes into audiovisual representations often over-emphasize speech within the audio track of an encoded scene. For this reason, typical representation learning models often inaccurately generate representations of similar scenes that have audio tracks that include different speech patterns. Instead of generating representations that reflect how similar the scenes are, these encoding models generate representations that are overly focused on the differing speech patterns in the audio tracks of the scenes.
As will be described in greater detail below, the present disclosure describes implementations that train machine-learning models to de-emphasize speech audio when generating audiovisual representations in a shared dimensional space.
In one example, a computer-implemented method for training a machine-learning model to accurately generate representations of similar scenes from long-form videos with different speech audio includes generating a training set including, from a long-form video, a video track clip, a primary language audio track corresponding to the video track clip, and a dubbed language audio track corresponding to the video track clip, applying the machine-learning model to the training set to generate a first video representation of the video track clip paired with the primary language audio track and a second video representation of the video track clip paired with the dubbed language audio track, and continually applying the machine-learning model to the training set until the first video representation and the second video representation are positioned within a threshold distance from each other within a representational space.
In some examples, the training set further includes, from the long-form video, additional video track clips, primary language audio tracks corresponding to the additional video track clips, and dubbed language audio tracks corresponding to the additional video track clips. Moreover, in some examples, the method further includes applying the machine-learning model to the training set to generate video representations of the additional video track clips paired with the primary language audio tracks corresponding to the additional video track clips, and video representations of the additional video track clips paired with the dubbed language audio tracks corresponding to the additional video track clips. Additionally, in some examples, the method includes continually applying the machine-learning model to the training set until the video representations of the additional video track clips paired with the primary language audio tracks corresponding to the additional video track clips and the video representations of the additional video track clips paired with the dubbed language audio tracks corresponding to the additional video track clips are positioned within the threshold distance from each other within the representational space.
In some examples, the dubbed language audio track corresponding to the video track is in one of Spanish, French, or Japanese. Additionally, in some examples, the machine-learning model includes convolutional neural network encoders and transformer models that are specialized for processing videos and audio spectrograms. Moreover, in some examples, the first video representation and the second video representation include 1024-dimensional vectors output by the convolutional neural network encoders. Furthermore, in some examples, the machine-learning model further includes multi-layer perceptron heads that project the first video representation and the second video representation into the representational space. Additionally, in some examples, the representational space includes a 512-dimensional space. In most examples, the method further includes applying the machine-learning model to a new long-form video for one or more of audiovisual scene classification, emotion recognition, action recognition, or speech keyword recognition.
Some examples described herein include a system with at least one physical processor and physical memory including computer-executable instructions that, when executed by the at least one physical processor, cause the at least one physical process to perform acts. In at least one example, the computer-executable instructions, when executed by the at least one physical processor, cause the at least one physical process to perform acts including generating a training set including, from a long-form video, a video track clip, a primary language audio track corresponding to the video track clip, and a dubbed language audio track corresponding to the video track clip, applying the machine-learning model to the training set to generate a first video representation of the video track clip paired with the primary language audio track and a second video representation of the video track clip paired with the dubbed language audio track, and continually applying the machine-learning model to the training set until the first video representation and the second video representation are positioned within a threshold distance from each other within a representational space.
In some examples, the above-described method is encoded as computer-readable instructions on a computer-readable medium. In one example, the computer-readable instructions, when executed by at least one processor of a computing device, cause the computing device to generate a training set including, from a long-form video, a video track clip, a primary language audio track corresponding to the video track clip, a first dubbed language audio track corresponding to the video track clip, and a second dubbed language audio track corresponding to the video track clip, apply a machine-learning model to the training set to generate a first video representation of the video track clip paired with the primary language audio track, a second video representation of the video track clip paired with the first dubbed language audio track, and a third video representation of the video track clip paired with the second dubbed language audio track, and continually apply the machine-learning model to the training set until the first video representation, the second video representation, and the third video representation are positioned within a threshold distance from each other within a representational space.
In one or more examples, features from any of the embodiments described herein are used in combination with one another in accordance with the general principles described herein. These and other embodiments, features, and advantages will be more fully understood upon reading the following detailed description in conjunction with the accompanying drawings and claims.
The accompanying drawings illustrate a number of exemplary embodiments and are a part of the specification. Together with the following description, these drawings demonstrate and explain various principles of the present disclosure.
Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the exemplary embodiments described herein are susceptible to various modifications and alternative forms, specific implementations have been shown by way of example in the drawings and will be described in detail herein. However, the exemplary implementations described herein are not intended to be limited to the particular forms disclosed. Rather, the present disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.
As mentioned above, typical encoding models over-emphasize speech sounds when generating scene representations. For example, in long-form videos such as movies, multiple scenes may take place in the same setting—often with the same actors and types of ambient noise. To illustrate, a movie may include multiple scenes that take place in the same café where the same actors have different conversations. As such, the scenes include similar visuals (e.g., the café and the actors) and similar ambience sounds (e.g., glasses clinking, background conversation, café music), but different speech audio (e.g., due to different conversations that take place). Typical machine-learning models that generate representational encodings of audiovisual scenes, however, often over-emphasize speech. As such, a typical encoding model will often generate very different representations of scenes that are very similar—save for the speech audio.
By generating these inaccurate representations, typical encoding models introduce additional inaccuracies in other machine learning tasks. To illustrate, machine-learning models can be used in combination with long-form videos to perform tasks such as time-based semantic and cautionary tagging, explainable recommendations, creation of personalized promotional assets, and so forth. Teaching these machine-learning models to perform these tasks often relies on encoded audiovisual representations; i.e., compressed numerical descriptions of audiovisual clips. In self-supervised learning, the structure of the data within these representations is crucial in avoiding the need for human-generated content annotations. For example, in self-supervised learning, a model can effectively learn to recognize objects by predicting if a slightly processed image belongs to the same category as the unprocessed original.
In connection with video content, models are often taught to learn representations by having the models predict correspondences between video modalities—such as the correspondence between a visual clip of a scene and an audio clip of the same scene. For example, a model is trained to recognize which auditory and visual clips match as part of the same original scene. This encourages the model to encode the clips into useful representations that capture multimodal characteristics, thereby allowing discrimination between different clips—while keeping similar clips closer together in a representational space. Despite this, when such a model over-emphasizes speech audio as mentioned above, the inaccurate representations it generates of similar scenes with different speech audio will not be close together in that representational space. These inaccuracies then bleed into other machine-learning based video and/or audio processing tasks.
Training such models to de-emphasize speech audio, however, is challenging. For example, as mentioned above, the “looking similar while sounding different” problem most often arises in connection with long-form videos where the same backdrops and actors are seen many times over the course of a movie or episode. Despite this, many models are trained with random short-form videos such as those found for free on the Internet. As such, this short-form video training data generally fails to reflect multiple scenes that look similar but with different speech audio.
To solve these problems, the present disclosure describes a system that leverages long-form video along with audio dubs to train a machine-learning model to accurately generate accurate audiovisual representations of scenes that are visually similar but have different speech audio. For example, as discussed in greater detail below, the described system generates training data including visual clips from long-form videos along with audio tracks corresponding to the visual clips that include multiple variations on the speech from those visual clips. To illustrate, the described system generates training data including a visual clip from a long-form video, an audio clip corresponding to the visual clip that includes speech in English, and at least one additional audio clip corresponding to the visual clip that includes speech with the same semantic meaning but in a different language (e.g., Spanish, French, and/or Japanese). The described system then applies an embedding model to this training data to generate an embedding of the visual clip and the English speech audio and an embedding of the visual clip and the additional speech audio. Over multiple training epochs, the described system helps the embedding model learn to generate these embeddings such that they are positioned within a threshold distance in a representative space—indicating that the embedding model has learned that the scenes are similar, even though the speech audio does not include the same semantic meaning.
Features from any of the implementations described herein may be used in combination with one another in accordance with the general principles described herein. These and any other implementations, features, and advantages will be more fully understood upon reading the following detailed description in conjunction with the accompanying drawings and claims.
The following will provide, with reference to
As just mentioned,
Because of the semantic differences between the speech 112a and the speech 112b, typical embedding models generate representations of the first scene 102a and the second scene 102b that are positioned very differently within a representational space. As such, and as mentioned above, a language-invariant audiovisual system is described herein that teaches an embedding model to de-emphasize speech audio when generating scene embeddings. With speech audio de-emphasized, the trained embedding model generates rich embeddings of similar scenes with semantically different speech that are positioned close to each other within a representative space—indicating that the embedding model has learned that the scenes are similar even when the speech audio is semantically different. Thus trained, the embedding model can be applied to other machine learning tasks in connection with long-form videos.
To train a machine-learning model to de-emphasize speech audio while generating long-form video scene embeddings, a language-invariant audiovisual system relies on dubbed audio. To illustrate, as shown in
In one or more examples, the language-invariant audiovisual system 200 further generates the training data including dubbed audio tracks. For example, as further shown in
In one or more implementations, as mentioned above, the language-invariant audiovisual system 200 utilizes the training data including dubbed language audio tracks to train a machine-learning model to generate video representations (e.g., embeddings or encodings) that de-emphasize speech audio. For example, as shown in greater detail in
As discussed above, the training data corresponding to the long-form video 302 also includes at least one dubbed language audio track—such as a dubbed language audio track in Spanish (ES). In most examples, this dubbed language audio track is similar to the primary language audio track except that the dialog or speech is replaced with its translation into another language. For example, the translation into the additional language is often not a direct translation, but rather includes the same semantic meaning as the original speech. Thus, the dubbed language audio track includes the same music, background noise, sound effects, etc. as the primary language audio track. The only difference is that any spoken language in English is replaced by spoken language in a secondary language such as Spanish.
Accordingly, during training, the language-invariant audiovisual system 200 applies the machine-learning model 300 to the training data such that the machine-learning model 300 passes a video track clip 304 from the long-form video 302 through a video encoder 310. The machine-learning model 300 further passes the primary language audio track 306 and the dubbed language audio track 308 through audio encoders 312, 314. In some implementations, the audio encoders 312, 314 are a single audio encoder rather than separate audio encoders. In one or more implementations, each of the video encoder 310, the audio encoder 312, and the audio encoder 314 output 1024-dimensional features, vectors, or representations. In most examples, the encoders 310, 312, and 314 are convolutional neural network encoders, specialized for processing videos and audio spectrograms, respectively. In some examples, the encoders 310, 312, and 314 accept 3-second video and audio clips, respectively. In additional implementations, the machine-learning model 300 further includes one or more transformation models.
In most examples, the machine-learning model 300 further passes these 1024-dimensional representations into a projection head model 316. In one or more implementations, the projection head model 316 is a multi-layer perceptron (MLP) head model. As such, the machine-learning model 300 passes the 1024-dimensional representations through the projection head model 316 to project the 1024-dimensional representations into a shared 512-dimensional space.
With the 1024-dimensional representations projected into this shared representational space, the machine-learning model 300 computes the contrastive objective function for the first cross-modal pair 318 (e.g., including the video track clip 304 paired with the primary language audio track 306) and the second cross-modal pair 320 (e.g., including the video track clip 304 paired with the dubbed language audio track 308). In one or more implementations, the machine-learning model 300 utilizes normalized temperature-scaled cross-entropy as the contrastive objective function. For example, for a positive pair (zi, zj), the machine-learning model 300 determines:
Where k is the number of clips that are sampled from the same film (e.g., where k is a hyperparameter, and
In certain implementations, the language-invariant audiovisual system 200 represents one or more software applications, modules, or programs that, when executed by a computing device, may cause the computing device to perform one or more tasks. For examples, and as will be described in greater detail below, one or more of the training set manager 406, the machine-learning model manager 408, or the evaluation manager 410 represent software stored and configured to run on one or more computing devices such as the server(s) 402. One or more of the training set manager 406, the machine-learning model manager 408, and the evaluation manager 410 of the language-invariant audiovisual system 200 shown in
As mentioned above, and as shown in
In one or more implementations, the training set manager 406 further generates the training set by dividing each long-form video in the long-form video repository 414 into 3-second non-overlapping clips. In some examples, the training set manager 406 maintains movie-level identifiers for each clip—for later use in a negative sampling strategy—while noting the hierarchical nature of the dataset. Alternatively, in some implementations, the training set manager 406 divides the long-form videos based on camera shot boundaries.
In at least one implementation, the training set manager 406 takes various modality-specific considerations. To illustrate, in most examples, the training set manager 406 applies, to video clips, a pipeline consisting of 1) uniformly subsampling 16 frames from a 3-second video clip (e.g., regardless of frame rate), 2) randomly rescaling the height of the subsampled frames to between 256 and 320 pixels while maintaining the aspect ratio, 3) randomly cropping to 224×224 pixels, and 4) applying a random augmentation model with 1 layer, and a distortion magnitude set to 9.
In one or more examples, the training set manager 406 utilizes audio tracks that are initially in 5.1 surround format (i.e., left channel, right channel, center channel, low frequency effect channel, left-surround channel, right-surround channel) with a sample rate of 48 kHz and a bit-depth of 16 bit. As such, in most examples, the training set manager 406 performs a standard AC-3 down-mix to stereo to preserve typical channel relationships before further pre-processing. In some examples, due to the scale of this multichannel audio, the training set manager 406 stores clips with compression in a high-quality codec at 256 kbps. The training set manager 406 further extracts audio spectrograms from the stereo audio tracks, parameterized with a 1024-dimensional Hanning window and a hop size of 512 with 96 mel bins. This results in an input representation of dimension C×96×280 where C=2 (for stereo) for each audio track. The training set manager 406 then applies both time and frequency masking as augmentation, each with a rate of 0.5.
As a result of all this pre-processing, the training set manager 406 generates a training set of 3-second video clips, corresponding primary language audio tracks, and corresponding dubbed language audio tracks. As discussed above, the training set manager 406 extracts these elements from long-form videos (e.g., such as those in the long-form video repository 414) to ensure that the training set includes scenes that look similar; namely including the same settings, the same actors, the same landmarks, the same buildings, the same backgrounds, etc.
As mentioned above, and as shown in
Moreover, in most examples, the machine-learning model manager 408 trains the machine-learning model utilizing the training set generated by the training set manager 406. For example, the machine-learning model manager 408 trains the machine-learning model cross-modally; computing the contrastive cost between modalities. This results in two optimized costs, summed together. These are between a video track and the corresponding primary language audio track (e.g., l(V, Ap) from the equation above), and the video track the corresponding dubbed language audio track (e.g., l(V, As) from the equation above).
In one or more implementations, the machine-learning model manager 408 continually applies the machine-learning model to the training set until the machine-learning model is fully trained. In some examples, the machine-learning model manager 408 trains the machine-learning model on four A100 GPUs for ten epochs with a batch size of 26 per GPU (total=104). In at least one example, the machine-learning model manager 408 uses the negative sampling parameter k (samples per long-form video), which is set to 12 per GPU (total=48). The machine-learning model manager 408 further uses an optimizer with β=(0.9, 0.999), a learning rate of 0.001, and weight decay set to 0.05. In some examples, the machine-learning model manager 408 uses a cosine learning rate schedule with a half-epoch warmup.
In one or more implementations, the machine-learning model manager 408 trains the machine-learning model following any of multiple variants. For example, in at least one implementation, the machine-learning model manager 408 performs baseline training according to a “pseudo-dub” variant. In the “pseudo-dub” variant, the machine-learning model manager 408 trains machine-learning models with two differently augmented primary (English) audio treated as “primary” and “secondary” audio, respectively. In most examples, this accounts for any possible effect of two augmentations per seen sample, as occurs for the with-dub cases.
In additional implementations, the machine-learning model manager 408 trains the machine-learning model according to a “bilingual” variant. In the “bilingual” variant, the machine-learning model manager 408 trains the machine-learning model across all of the secondary languages alternately during the individual steps of the same training. For example, the machine-learning model manager 408 alternately trains one or more machine-learning models for English/Spanish, English/French, and English/Japanese. Over the course of training, the machine-learning model manager 408 continually applies the machine-learning model to the training set until the machine-learning model outputs video representations for each of the language pairs that are positioned close to each other within a representational space. In most examples, the machine-learning model manager 408 trains the machine-learning model cross-modally. As such, the machine-learning model manager 408 continually applies the machine-learning model to the training set until each video representation is positioned close to both the primary and secondary audio representations within the representational space. Thus, for example, the machine-learning model manager 408 continually applies the machine-learning model to the training set until the machine-learning model outputs video representations that are positioned within the 512-dimension representational space and within a threshold distance from each other.
In additional implementations, the machine-learning model manager 408 trains the machine-learning model according to a “multilingual” variant. In the “multilingual” variant, the machine-learning model manager 408 randomly selects a dubbed or secondary language from the given list (Spanish, French, Japanese) per batch. The machine-learning model manager 408 randomizes the order of samples, and then circles through the list round-robin.
In additional implementations, the machine-learning model manager 408 trains the machine-learning model according to a “no-speech” variant. In the “no-speech” variant, the machine-learning model manager 408 trains the machine-learning model on English audio with the speech removed. For example, the training set manager 406 generates a training set with vocal sounds separated out of the primary language audio tracks corresponding to the training video tracks.
In additional implementations, the machine-learning model manager 408 trains the machine-learning model according to an “audio-only” variant. In the “audio-only” variant, the machine-learning model manager 408 trains two audio-only (e.g., no video) models. The machine-learning model manager 408 utilizes training data similar to that used in connection with the “pseudo-dub” and “multilingual” variants, except without any video tracks. As such, the machine-learning model manager 408 applies the objective function within-modal (i.e., between the two audio clips). The “audio-only” variant represents standard audio-based contrastive training with two augmented copies.
In one or more implementations, the machine-learning model manager 408 trains one or more machine-learning models according to any of the variants discussed above. In some implementations, the machine-learning model manager 408 compares the outputs of machine-learning models trained according to any of these variants to determine optimal training strategies.
As mentioned above, and as shown in
Moreover, as shown in
As shown in
Additionally, in most examples, the server(s) 402 includes a memory. In one or more implementations, the memory generally represents any type or form of volatile or non-volatile storage device or medium capable of storing data and/or computer-readable instructions. In one example, the memory stores, loads, and/or maintains one or more of the components of the language-invariant audiovisual system 200. Examples of the memory can include, without limitation, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Hard Disk Drives (HDDs), Solid-State Drives (SSDs), optical disk drives, caches, variations or combinations of one or more of the same, and/or any other suitable storage memory.
As mentioned above,
As illustrated in
Additionally, as illustrated in
Furthermore, as illustrated in
In summary, the language-invariant audiovisual system 200 effectively augments contrastive training of a machine-learning model using dubbed language audio tracks to help the machine-learning model learn to de-emphasize speech audio when generating representations of audiovisual clips. As discussed above, the language-invariant audiovisual system 200 generates and utilizes training data taken from long-form video to teach the machine-learning model to recognize that representations of similar looking scenes with semantically dissimilar speech audio should still be positioned near each other in a representational space. Utilizing this dubbed audio strategy, the language-invariant audiovisual system 200 encourages the machine-learning model to discover deeper audiovisual alignments beyond spoken words. As such, the trained machine-learning model is suitable for use in a diverse range of audio, visual, and multimodal recognition tasks—particularly in connection with long-form video such as movies and longer TV episodes.
Example 1: A computer-implemented method for training a machine-learning model to accurately generate representations of similar scenes from long-form videos that have semantically dissimilar speech audio. For example, the method may include generating a training set including, from a long-form video, a video track clip, a primary language audio track corresponding to the video track clip, and a dubbed language audio track corresponding to the video track clip, applying the machine-learning model to the training set to generate a first video representation of the video track clip paired with the primary language audio track and a second video representation of the video track clip paired with the dubbed language audio track, and continually applying the machine-learning model to the training set until the first video representation and the second video representation are positioned within a threshold distance from each other within a representational space.
Example 2: The computer-implemented method of Example 1, wherein the training set further includes, from the long-form video, additional video track clips, primary language audio tracks corresponding to the additional video track clips, and dubbed language audio tracks corresponding to the additional video track clips.
Example 3: The computer-implemented method of any of Examples 1 and 2, further including applying the machine-learning model to the training set to generate video representations of the additional video track clips paired with the primary language audio tracks corresponding to the additional video track clips, and video representations of the additional video track clips paired with the dubbed language audio tracks corresponding to the additional video track clips.
Example 4: The computer-implemented method of any of Examples 1-3, further including continually applying the machine-learning model to the training set until the video representations of the additional video track clips paired with the primary language audio tracks corresponding to the additional video track clips and the video representations of the additional video track clips paired with the dubbed language audio tracks corresponding to the additional video track clips are positioned within the threshold distance from each other within the representational space.
Example 5: The computer-implemented method of any of Examples 1-4, wherein the dubbed language audio track corresponding to the video track clip is in one of Spanish, French, or Japanese.
Example 6: The computer-implemented method of any of Examples 1-5, wherein the machine-learning model includes convolutional neural network encoders and transformer models that are specialized for processing videos and audio spectrograms.
Example 7: The computer-implemented method of any of Examples 1-6, wherein the first video representation and the second video representation include 1024-dimensional vectors output by the convolutional neural network encoders.
Example 8: The computer-implemented method of any of Examples 1-7, wherein the machine-learning model further includes multi-layer perceptron heads that project the first video representation and the second video representation into the representational space.
Example 9: The computer-implemented method of any of Examples 1-8, wherein the representational space includes a 512-dimensional space.
Example 10: The computer-implemented method of any of Examples 1-9, further including applying the machine-learning model to a new long-form video for one or more of audiovisual scene classification, emotion recognition, action recognition, or speech keyword recognition.
In some examples, a system may include at least one processor and a physical memory including computer-executable instructions that, when executed by the at least one processor, cause the at least one processor to perform various acts. For example, the computer-executable instructions may cause the at least one processor to perform acts including generating a training set including, from a long-form video, a video track clip, a primary language audio track corresponding to the video track clip, and a dubbed language audio track corresponding to the video track clip, applying the machine-learning model to the training set to generate a first video representation of the video track clip paired with the primary language audio track and a second video representation of the video track clip paired with the dubbed language audio track, and continually applying the machine-learning model to the training set until the first video representation and the second video representation are positioned within a threshold distance from each other within a representational space.
Additionally in some examples, a non-transitory computer-readable medium can include one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to perform various acts. For example, the one or more computer-executable instructions may cause the computing device to generate a training set including, from a long-form video, a video track clip, a primary language audio track corresponding to the video track clip, a first dubbed language audio track corresponding to the video track clip, and a second dubbed language audio track corresponding to the video track clip, apply a machine-learning model to the training set to generate a first video representation of the video track clip paired with the primary language audio track, a second video representation of the video track clip paired with the first dubbed language audio track, and a third video representation of the video track clip paired with the second dubbed language audio track, and continually apply the machine-learning model to the training set until the first video representation, the second video representation, and the third video representation are positioned within a threshold distance from each other within a representational space.
Unless otherwise noted, the terms “connected to” and “coupled to” (and their derivatives), as used in the specification and claims, are to be construed as permitting both direct and indirect (i.e., via other elements or components) connection. In addition, the terms “a” or “an,” as used in the specification and claims, are to be construed as meaning “at least one of,” Finally, for ease of use, the terms “including” and “having” (and their derivatives), as used in the specification and claims, are interchangeable with and have the same meaning as the word “comprising.”
This application claims the benefit of U.S. Provisional Application No. 63/424,377, filed Nov. 10, 2022, the disclosure of which is incorporated in its entirety, by this reference.
Number | Date | Country | |
---|---|---|---|
63424377 | Nov 2022 | US |