EQUIVARIANT MODELS FOR GENERATING VECTOR REPRESENTATIONS OF TEMPORALLY-VARYING CONTENT

BACKGROUND

Machine learning (ML) is increasingly applied to a wide variety of computational tasks, (e.g., machine vision-related tasks). Many applications of ML involve training a model (e.g., often implemented by a neural network (NN) architecture) to generate a relatively-low dimensional vector representation of relatively-high dimensional input data (e.g., image data). For example, NNs (e.g., 2D convolutional neural networks (CNNs)) have been trained to encode static images (e.g., image data comprising of around 1 mega-pixels that encodes an image or a single frame of video content) as vectors in dimensionality ranging from hundreds to thousands of components. Once embedded within a vector-space, the vector representations of images may be employed to perform various machine vision tasks, such as but not limited to image classification, by comparing the vector with other vectors representing other content in the vector space. In many instances, a CNN may be trained via supervised-learning (SL) techniques that require hundreds or even thousands of examples of manually-labeled images (e.g., hand-labeled training datasets).

To reduce the burden of generating labeled training datasets, in recent years, self-supervised learning (SSL) techniques, such as “contrastive” learning techniques have been devised to train CNNs to generate vector representations of images. In contrastive-learning scenarios, rather than hand-labeled training data, a CNN may be presented with multiple “versions” of the same “seed” image. That is, a set of input transformations is employed to generate variance in multiple instances of the same “seed” image. Thus, a varied training dataset may be automatically generated from a multiple “seed” images, where the training dataset includes multiple (but varied via different input transformations) instances of the same “seed” image. Thus, each “seed” image may serve as a labeled “class” of training images. The CNN is trained (via a loss function that is targeted to decrease one or more “errors”) to generate vector representations of the multiple versions of the same “seed” image, such that the corresponding vector representations are confined to a relatively small (and simply connected) region (e.g., a sub-space) of “similarity” in the vector space (e.g., the model's manifold).

For example, to generate a varied training dataset, various transforms (e.g., image croppings, shifts, boosts, rotations, reflections (e.g., point inversion), color jittering, and the like) may be applied to each image in a “seed” image dataset. Training images generated by the same “seed” image (but with different croppings, reflections, color jittering, and the like) may be automatically labeled with the same label. Training images generated by different “seed” images may be labeled with different labels. A CNN may be trained to generate “similar” vectors for training images generated from the same “seed” image (e.g., different transforms applied to the same “seed” image), and “dissimilar” vectors for training images generated from different “seed” images.

It may be said that such contrastive learning generates models that are “invariant” to the class of transforms that are employed to generate the variance in the training datasets. That is, the trained models generate similar (e.g., approximately invariant) vectors for two input static images that were generated from different transformations (e.g., of the set of input transformations that we employed to generate variance in the training data set) applied to the same “seed” image. However, such contrastive learning methods, which generate invariant models, may not be robust enough to generate meaningful vector representations for content that varies temporally. For example, forcing a model to be invariant to temporal transformations (e.g., time shifts) in video content (e.g., an ordered sequence of static images) may not generate vector representations that adequately capture the dynamics (e.g., action-oriented sequences) inherent to video content.

SUMMARY

The technology described herein is directed towards methods, systems, and architecture for training and employing models for generating vector representations of content. The content may vary in one or more domains. The models may be equivariant to transformations in one or more of the domains. As discussed throughout, equivariance is a generalization of the concept of invariance. However, in contrast to invariance, the vector representations generated by the equivariant models discussed herein may be robust enough to encode a greater dynamical-range of variance within content. The embodiments may be directed towards self-supervised learning (SSL) methods that train representational models to be equivariant to one or more domains, such as but not limited to a temporal domain. For example, the content may be temporally-varying content, such as but not limited to video content. The models may be equivariant to temporal transformations of the content and invariant to non-temporal transformations of the content. The models may be trained via enhanced SSL methods, such as but not limited to contrastive learning methods. The training may generate a model that is equivariant to a temporal domain, and invariant to one or more non-temporal domains, such as one or more spatial domains. The equivariance in the models is not limited to a temporal domain, and the training methods disclosed herein may be generalized to generate equivariance to non-temporal domains. For example, the methods may be employed to generate equivariance to spatial transformations applied to the content.

In one embodiment, a training method includes receiving a set of temporally-varying training content. The method may be an iterative method, where a single iteration of the methods includes generating a set of training content pairs. Each pair of the set of training content pair may include two separate training contents from the set of training content. Each training content of the set of training content may be included in a single training content pair. A separate pair of temporal transformations (from a set of temporal transformations) may be associated with each training content pair of the set of training content pairs. A separate pair of non-temporal transformations (from a set of non-temporal transformations) may be associated with each training content of the set of training content. For each iteration of the method, the pairings of the training content, associating the pairs of temporal transformations to the content pairs, and associating the non-temporal transformations to the training contents may be subject to a random and/or stochastic process and may vary during each iteration. That is, the pairing of the training content, as well as the associating of temporal and non-temporal transformations may be (pseudo-)randomly re-shuffled for each iteration of the method, until the model converges to a stable model.

During an iteration of the method, and for each training content pair of the set of training content pairs, two versions of each training content is generated based on an application of the associated pair of temporal transformations. For each training content pair, each of the two versions of each of the training contents is updated based on an application of the associated non-temporal transformation. For each training content pair, a vector representation is generated for each updated version of each training content of the content pair based on a representational model. The representational model may be implemented by a 3D convolutional neural network (e.g., a CNN). For each training content pair, a concatenated vector is generated for each training content of the content pair is generated based on a combination (e.g., a concatenation) of the vector representations of each training content. For each training content pair, a relative transformation vector may be generated for each training content of the pair based on a multilayer perceptron (MLP) model and the concatenated vector for the training content. The weights of both the representational and MLP models may be updated based on a contrastive learning loss function. The contrastive learning loss function may be updated to “attract” pairs of relative transformation vectors associated with training content included in the same content pair and “repel” pairs of relative transformation vectors associated with training content that are not included in the same content pair. The method may be iterated until the weights of the models converge to stable values.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A illustrates non-limiting examples of non-temporal transformations applied to input content, according to various embodiments.

FIG. 1B illustrates non-limiting examples of temporal transformations applied to non-limiting examples of temporally-varying content, according to various embodiments.

FIG. 1C illustrates additional non-limiting examples of temporal transformations applied to non-limiting examples of temporally-varying content, according to various embodiments.

FIG. 2 illustrates a non-limiting architecture for training equivariant models, in accordance to various embodiments.

FIG. 3 illustrates an architecture for instance contrastive learning, in accordance to various embodiments.

FIG. 4 illustrates one embodiment of a method for employing a representational model, which is consistent with the various embodiments presented herein.

FIG. 5 illustrates one embodiment of a method for training a representational model, which is consistent with the various embodiments presented herein.

FIG. 6 illustrates one embodiment of a method for instance contrastive learning, which is consistent with the various embodiments presented herein.

FIG. 7 is a block diagram of an example computing device in which embodiments of the present disclosure may be employed.

DETAILED DESCRIPTION

The embodiments are directed towards systems and methods that enable equivariant models for generating representations (e.g., vector representations) of temporally-varying content, such as but not limited to video content. Systems and methods are presented for training representational models that are equivariant to at least some class of transformations applied to content. The content may be temporally-varying content (e.g., video content). In such embodiments, the models may be equivariant to temporal transformations (e.g., time shifts). However, the embodiments may not be so limited, and the trained models may be equivariant to non-temporal transformations, such as but not limited to spatial transformations, color transformations, and the like. The trained models may additionally be invariant to other classes of transformations, such as non-temporal transformations (e.g., spatial and/or color-space transformations) applied to the input content. Such representations may be employed in various machine learning tasks, such as but not limited to video retrieval (e.g., video search engine applications) and identification of actions depicted in video content.

Invariance generally refers to a symmetrical property of an operation (e.g., a function that generates a vector representation of an input object), with respect to one or more classes of transformations. A function (e.g., as implemented by a model) is generally a mapping from a domain to a codomain. For an invariant function, when a symmetry group (corresponding to the symmetry class of the invariance) operates on an element of the function's domain, the codomain's corresponding element remains unchanged (e.g., the operation of the symmetry group on the function's domain does change the mapping from the domain to the codomain). As used throughout, equivariance may refer to a generalization of the concept of invariance. An equivariant function (with respect to a class of transformation) may be a mapping, where the same symmetry group (corresponding to the relevant transformation class) acts on both the function's domain and codomain. Furthermore, the symmetry group may commute with the equivariant function. That is, applying the transformation to the domain element and then computing the function (determining the corresponding codomain element) is (at least approximately) equivalent to computing the function on the domain element and then applying the transformation to determine the corresponding codomain element.

In contrast to conventional systems that generate vector representations for static images, the various embodiments generate vector representations for content that varies across a temporal domain, e.g., a temporally ordered sequence of images or frames. In the embodiments, a single vector may be generated that represents the video content across multiple frames, whereas in conventional systems, a separate vector may be required to represent each static image (e.g., a frame) of the temporally-varying content. The single vector representing multi-frames of content may encode the dynamic-evolution of actions depicted across the multiple frames. Conventional systems that encode a single frame with a vector may fail to encode such temporal dynamics. In additional to encoding dynamical properties of temporally varying content, the embodiments employ contrastive learning (e.g., a form of self-supervised learning) to train the equivariant models. Accordingly, in further contrast to many conventional systems, the various embodiments do not require the (often manual) effort to create labeled training data sets, as many conventional (e.g., supervised learning) training methods require.

Although contrastive learning has been previously employed to generate representational models for static images, these models lack the temporal aspect included in the various embodiments. Furthermore, even if conventional learning is applied to temporally-varying content, such conventional contrastive learning may fail to encode the dynamical aspects encoded in the input content. Conventional contrastive learning may generate invariant models. That is, the models generated by conventional contrastive learning may be invariant to the non-temporal transformations that are employed to introduce variance into the training dataset. An invariant model would not encode the dynamic progression of action across multiple frames of video content.

The various embodiments employ contrastive learning to generate equivariant models, where the vector representations encode the dynamics of video content. Due to the equivariant nature of the models, the embodiments are enabled to distinguish between different temporal transformations applied to the same content. In contrast, conventional models are invariant to different transformations applied to the same content. The set of temporal transformations applied to the training dataset (e.g., of “seed” video content) may include temporal shifts and/or temporal croppings, as well as playback speed and playback direction. In some embodiments, the equivariant models are enabled to classify a temporal-ordering of multiple temporal transformations of the same video content. For example, various embodiments may temporally-order two temporally-transformed versions of video content (e.g., separate clips of the same video: clip A and clip B). Such embodiments may determine whether if clip A occurs before clip B (or vice-versa), and whether there is a temporal overlap in the two clips.

The various embodiments employ a contrastive approach to training representational models that exhibit equivariance to a set or class of input transformations, such as but not limited to a set of temporal transformation. A function (as implemented by the model) may be said to be equivariant to the set of transformation, when each transformation of the set of transformation corresponds to an element of a symmetry group associated with both of the function's domain and codomains, and the operation (or affect) of the elements of the symmetry group commute with the (operation) of the function. Thus, an equivariant function may be said to be implemented by an equivariant model. More specifically, a relative temporal transformation between two samples of training content is encoded in a feature-vector. During training, the feature vector for each sample may be contrasted with each other (e.g., the pair of vectors is associated as a positive-contrastive vector pair). When separate relative temporal transformations are applied to samples of training content, the resulting features vectors are associated with one another as negative-contrastive vector pairs. During training, the representational model (generating the feature vectors) is updated (e.g. the weights of the model are adjusted), such that the individual vectors of a positive-contrastive pair of vectors are attracted to one another and the individual vectors of a negative-contrastive pair are repelled to one another.

Training the models and the equivariance of the models will now be discussed. In general, let custom-character ={x₁, x₂, x₃, . . . x_N} be a finite set of content. In non-limiting embodiments, each element of may be temporally-varying content (e.g., a video clip). The set may be referred to as the training dataset. A representational model (e.g., F) may be trained (via contrastive learning) to map each element of custom-character onto an D-dimensional flat (e.g., Euclidean) manifold (e.g., ^D). That is, F(x_i)=^D, for ∀x_i∈. The D-dimensional manifold may be referred to feature-vector space and/or the feature space. Accordingly, the mapping (generated by the model F) between the set and the feature space may be referred to as the feature mapping. The representational model may be implemented by one or more neural networks (NNs), such as but not limited to a convolutional neural network (CNN). Because the content includes a temporal dimension (and each frame includes two spatial dimensions), the CNN may be a 3D CNN. Let custom-character represent a set of temporal transformations applicable to the elements of . If the elements of the set are discrete, the elements of may be ordered and indexed as: τ_i. If the elements of the set are not discrete, the elements of may be referenced as: τ_Θ, where Θ∈^Dis a set of parameters parameterizing the elements. The notation employed is not meant to imply that the dimensionality of the parameter space is equivalent to the dimensionality of the feature space. Let custom-character represent a set of non-temporal transformations applicable to the elements of . If the elements of the set are discrete, the elements of may be ordered and indexed as: σ_i. If the elements of the set are not discrete, the elements of may be referenced as: σ_Θ, where Θ∈^Dis a set of parameters parameterizing the elements.

The various embodiments are employed to train the representational model to be equivariant to a set of temporal transformations applied to the input content. In some embodiments, the model is trained to be invariant to the set of non-temporal transformations applied to the input content. The invariance of F to the set of non-temporal transformations may be expressed as: ∀x_i∈ custom-character and ∀σ_j∈: F (σ_j(x_i))≈F(x_i). The equivariance of F to the set of temporal transformations may be expressed as: ∀x_i∈ and ∀τ_j∈: F (τ_j(x_i))≈τ_j(F(x_i)), where τ_j∈, is a set of transformations applicable to the feature space of the model (e.g., (:^D→^D)), and τ_jand τ_jare corresponding transformations in their respective sets of transformations. In the various embodiments, τ_jmay be identified given the vector-pair (F(x_i), F (τ_j(x_i))), irrespective of the input element (x_i) employed to generate the vector pair. That is, the map or correspondence between τ₁and τ_jis not dependent upon the content being transformed and/or being embedded in the feature space. In various embodiments, the model is trained to generate (or recognize) the relative transformation between τ_i(x) and τ_j(x), for two arbitrary transformation: τ_i, τ_j∈ custom-character , which is independent of the input content. Because the set of transformations are transformations in the feature space manifold, may be referred to as the set of feature space transformations.

Turning to FIG. 1A, FIG. 1A illustrates non-limiting examples of non-temporal transformations applied to input content 100, according to various embodiments. Input content 100 includes a static image, which may be a “seed” content or “seed” image for contrastive learning methods. Input content 100 may be a single frame from temporally varying content (e.g., video content). Four non-temporal transformations: {σ₁, σ₂, σ₃, σ₄}∈ custom-character are applied on the input content 100 to generate four additional (e.g., contrastive) input images: first image 102, second image 104, third image 106, and fourth image 108. Non-temporal transformations may include any transformation applied to input image 100 (or any other single frame of video content), such as but not limited spatial-croppings, rotations, inversions, reflections, color transformations (e.g., color jitterings), tone mappings, and the like. In the various embodiments, the representational models may be trained to be invariant to non-temporal transformations included in the set of non-temporal transformation ( custom-character ). That is, the vector representations (generated via the representational model F) of each of input content 100, first image 102, second image 104, third image 106, and fourth image 108 are all confined to a relatively small (and simply connected) subspace of the manifold: custom-character ^D, where, due to the invariance of the representational model, F(x)≈F(σ(x))∈^D.

FIG. 1B illustrates non-limiting examples of temporal transformations applied to non-limiting examples of temporally-varying content, according to various embodiments. First input content 110 and second input content 120 may be temporally-varying content (e.g., video content). In the non-limiting examples shown in FIG. 1B, each of the first input content 110 and second input content 120 may include video depicting an individual performing a diving-related action via a diving board and a diving pool. First input content 110 may be referred to as x₁and second input content may be referred to as x₂. A first temporal transformation (τ₁∈ custom-character ) is applied to each of the first input content 110 and second input content 120, generating first version 112 of first input content 110 and first version 122 of second input content 120. A second temporal transformation (τ₂∈) is applied to each of the first input content 110 and second input content 120, generating second version 114 of first input content 110 and second version 124 of second input content 120.

In the non-limiting embodiments shown in FIG. 1B, each of τ₁and τ₂, when applied to the input content 110/120, generate temporally-shifted “clips” (112, 114, 122, and 124) of the corresponding input content 110/120. That is, each of τ₁and τ₂are temporally-shifted “clip” style transformations. Other temporal transformations are considered, such as playback speed, playback direction, and the like. For example, see FIG. 1C. Note that each of temporally-varying contents 110, 112, 114, 120, 122, and 124 are “video clips,” where video clips 112 and 114 are “sub-clips” of first input content 110 and video clips 122 and 124 are “sub-clips” of second input content 120. There is a temporal shift between sub-clip 112 and sub-clip 114. The same temporal shift exists between sub-clip 122 and sub-clip 124. This temporal shift is a relative temporal transformation between video clips 112 and 114 (as well as between video clips 122 and 124). Because these video sub-clips 112/114/122/124 are generated by applying a temporal transformation (e.g., τ₁or τ₂) on x₁or x₂, first sub-clip 112 may be referenced as τ₁(x₁), the second sub-clip 114 may be referenced as τ₂(x₁), the third sub-clip 122 may be referenced as τ₁(x₂), and the fourth video sub-clip 114 may be referenced as τ₂(x₂).

The embodiments are enabled to generate vector representations of each of first input content 110, first sub-clip 112, second sub-clip 114, second input content 120, third sub-clip 122, and fourth sub-clip 124, via the application of the representational model F. The vector representation of the first sub-clip 112 is shown as first vector 116 and may be referenced as F(τ₁(x₁)). The vector representation of the second sub-clip 114 is shown as second vector 118 and may be referenced as F(τ₂(x₁)). The vector representation of the third sub-clip 122 is shown as third vector 126 and may be referenced as F(τ₁(x₂)). The vector representation of the fourth sub-clip 124 is shown as fourth vector 128 and may be referenced as F(τ₂(x_z)).

Note that due to the equivariance of F, there is a set of feature space transformations ( custom-character ) of the feature space of F, including τ₁, τ₂∈, such that F(τ₁(x₁))≈τ₁(F(x₁)), F(τ₂(x₁))≈τ₂(F(x₁)), F(τ₁(x₂))≈τ₁(F(x₂)), and F(τ₂(x₂))≈τ₂(F(x₂)). Note that the relative temporal transformation between τ₁and τ₂corresponds to the shift in feature space between first vector 116 and second vector 118 (as well as a similar shift between third vector 126 and 128). According, there is a transformation in the set of feature space transformations (τ_1→2∈ custom-character ) that corresponds to the temporal transformation τ₁→τ₂, such that τ_1→2(F(τ₁(x₁)))≈τ₂(F(x₁))≈F(τ₂(x₁)) and τ_1→2(F(τ₁(x₂)))≈τ₂(F(x₂))≈F(τ₂(x₂)) Note that the behavior of transformations in feature space are independent of the input content (e.g., x₁and x₂).

In various embodiments, the model is trained to recognize and/or detect the mapping (or correspondence) between the temporal transformations of the set of temporal transformations (e.g., custom-character ) and the feature space transformations of the set of feature space transformations (e.g., ). More specifically, the training of the model F enables the recognition (or identification) of at least the relative transformations, in each of the temporal space and the feature space. For example, given two temporal transformations of the same video content (e.g., τ₁(x₁), τ₂(x₁)), some embodiments may be enabled to recognize and/or identify (as well as classify) the corresponding transformations τ_1→2and/or τ_1→2. More, specifically, a temporal ordering of first sub-clip 112 and second sub-clip 114 (both sub-clips of first input content 110) is such that first sub-clip 112 occurs prior to second sub-clip 114 with no overlap between the sub-clips 112/114, and that each are temporal transformations of first input content 110. Similarly, a temporal ordering of third sub-clip 122 and fourth sub-clip 124 (both sub-clips of second input content 120) is such that third sub-clip 122 occurs prior to fourth sub-clip 124 with no overlap between the sub-clips 122/124, and that each are temporal transformations of second input content 120. Due to the equivariance of the representation model F, the embodiments are enabled to recognize and/or identify the relative temporal shift (e.g., τ₂−τ₁) based on a comparison of the vector representations of the first sub-clip 112 and the second sub-clip 114. Note that the shift (e.g., custom-character ^D) between from the vector representation of the first sub-clip 112 and the second sub-clip 114 is generated via the relative transformation's corresponding element in . The embodiments are also enabled to recognize that the same relative temporal transformation has been applied to each of the first input content 110 and the second input content, based on the corresponding vector representations.

FIG. 1C illustrates additional non-limiting examples of temporal transformations applied to non-limiting examples of temporally-varying content, according to various embodiments. Third input content 130 and fourth input content 140 may be temporally-varying content (e.g., video content). In the non-limiting examples shown in FIG. 1C, each of the third input content 130 and fourth input content 140 may include video depicting a baseball game. Third input content 130 may be referred to as x₃and fourth input content may be referred to as x₃. In FIG. 1C, “video clips” generated by temporal transformations are indicated by the “clipping” boxes shown in x₃130 and x₄140. A third temporal transformation (τ₃∈ custom-character ) is applied to each of the third input content 130 and fourth input content 130, generating a first version 132 (e.g., τ₃(x₃) 132) of third input content 130 and a first version 142 (e.g., τ₃(x₄)) of fourth input content 140. A fourth temporal transformation (τ₄∈ custom-character ) is applied to each of the third input content 130 and fourth input content 140, generating a second version 134 (e.g., τ₄(x₃) 134) of third input content 130 and a second version 144 (e.g., τ₄(x₄) 144) of fourth input content 140. A fifth temporal transformation (τ₅∈ custom-character ) is applied to each of the third input content 130 and fourth input content 140, generating a third version 136 (e.g., τ₅(x₃) 136) of third input content 130 and a third version 146 (e.g., τ₅(x₄) 146) of fourth input content 140.

In additional to generating a temporal “crop” of x₃130 and x₄140, τ₃∈ custom-character transforms the playback speed of each of x₃130 and x₄140 from “normal” (e.g., 1×) to twice the playback speed (as indicated by the 2× markings in FIG. 1C). Due to the increased playback speed, τ₃(x₃) 132 and τ₃(x₄) 142 may have an increased number of frames, as compared to other clips of a “1×” playback speed. The arrows pointing to the right for each of τ₃(x₃) 132 and τ₃(x₄) 142 indicate that τ₃does not generate a transformation (e.g., a reversal) in the playback direction

As shown in FIG. 1C, the application of τ₄∈ custom-character generates a temporal crop (that is shifted to the “right” of the temporal crop of τ₃) in each of x₃130 and x₄140, but does not transform the playback speed nor the playback direction. The direction arrows pointing to left of τ₅(x₃) 136 and τ₅(x₄) 146 indicate that in addition to generating a temporal crop (that shifted to the “right” of the temporal crops of τ₃and τ₄), τ₅∈ custom-character generates a reversal of playback direction in each of τ₅(x₃) 136 and τ₅(x₄) 146. Due to equivariance of F, F(τ₅(x₃))≈F(τ₅(x₄)). As discussed below, the training of the model F enables the identification and/or detection of relative temporal transformations, e.g., τ_3→4, τ_3→5, and τ_4→5. As noted, corresponding transformations in feature space exist that corresponds to the relative temporal transformations, e.g., τ_3→4, τ_3→5, τ_4→5∈ custom-character .

FIG. 2 illustrates a non-limiting architecture 200 for training equivariant models, in accordance to various embodiments. To train an equivariant model (e.g., F), architecture may receive, as input, a set of training “seed” content that is temporally-varying. Such training “seed” content may include, but is not limited to, video content. Three samples from the set of training content are shown in FIG. 2: x₁202, x₂204, and x₃206, which may be interchangeably referred to as first training content 202, second training content 204, and third training content 206 Throughout, the training “seed” contents may be referred to as first input content 202, second input content 204, and third input content 206. Each of x₁202, x₂204, and x₃206 may be video content, or a video clip. Note that the set of training content may include many additional video clips, as designated by the set of vertical dots. In various embodiments, the set of training content includes M separate contents, thus the training contents may be referenced via an integer index i, ranging from 1 to M, e.g., x₁.

The equivariant model is primarily implemented by a set of convolutional neural networks (CNNs) 224 and/or a set of multilayer perceptrons (MLPs 226). Because each frame of each input content may be a 2D array of pixel values, and each frame of each training content represents a separate temporal slice of the training content, each CNN of the set of CNNs 224 may be a 3D CNN. In various embodiments, the set of CNNs 224 may include only a single copy of a CNN, and the “fan-out” display illustrated in FIG. 2 is shown to demonstrate the training architecture. In other embodiments, the set of CNNs 224 may include multiple copies (of identical CNNs), that are iteratively updated together (e.g., a “serial” architecture and/or training pipeline). The same alternative “serial” or “parallel” architectures/pipelines (e.g., the fan-out architecture shown in FIG. 2) are possible for the set of MPLs 226.

As noted throughout, once trained, the model is equivariant to a set of temporal transformations (e.g., custom-character ) applicable to the elements of the set of input content and invariant to a set of non-temporal transformations (e.g., ) also applicable to each element of the set of input content. The application of four temporal transformations (e.g., τ₁, τ₂, τ₃, τ₄∈) of the set of temporal transformations are shown in FIG. 2. The set of temporal transformations may include additional transformations. For example, the set of temporal transformations may include N temporal transformations and may be indexed via index j, ranging from 1 to N, τ_j∈ custom-character , where 1≤j≤N. The application of six non-temporal transformations (e.g., σ, σ₂, σ₃, σ₄, σ₅, σ₆∈) of the set of non-temporal transformations are shown in FIG. 2. The set of non-temporal transformations may include additional transformations. For example, the set of non-temporal transformations may include L non-temporal transformations and may be indexed via index k, ranging from 1 to L, σ_k∈ custom-character , where 1≤k≤L. Additional temporal and non-temporal transformations may be applied to additional samples of the set of training content, as indicated by the ellipses.

The models are trained iteratively via contrastive learning methods. As such, training the models includes iteratively updating the weights of the various neural networks (e.g., the set of CNNs 224 and the set of MLPs 226) via backpropagation techniques. The weights are adjusted to iteratively decrease a “loss” or “error” function that is defined in pairs of vectors. The loss functions employed are contrastive loss functions that increase (or decrease) a similarity metric for pairs of vectors generated by the models. Pairs of vectors that the model updates generate an increase in their similarity metric may be referred to as a “positive-contrastive vector pair.” Pairs of vectors that the model's iterative updates generate a decrease in their similarity metric may be referred to as a “negative-contrastive vector pair.” The models are trained via a relative temporal transformation (e.g., τ_1→2, τ_3→4∈ custom-character ) applied to pairs of input content. As noted throughout, the models are trained to recognize and/or identify the corresponding relative transformations in feature space (e.g., τ_1→2, τ_3→4∈) based on content that has been temporally transformed by the corresponding relative temporal transformations, e.g., τ_1→2and τ_3→4respectively. In various embodiments, the models are trained to make the inverse inference. That is, the trained models may recognize and/or identify the relative temporal transformations, via transformations (e.g., rotations, reflections, inversions, or displacements) in the feature space.

The models are trained in pairs of training contents (e.g., positive-contrastive pairs or negative contrastive pairs). For some pairs of training contents (e.g., positive-contrastive pairs of content), a same relative temporal transformations is applied each content of the pair. For other pairs of training contents (e.g., negative-contrastive pairs of content), different relative temporal transformations re applied to each content of the pair. Content pairs with the same applied relative temporal transformation (e.g., τ_1→2) are employed to generate positive-contrastive pairs of vectors. Content pairs with different applied relative temporal transformations (e.g., τ_1→2and τ_3→4) are employed to generate negative-contrastive pairs of vectors.

More specifically, two separate temporal transformations (e.g., τ₁and τ₂) are applied to a first training content (e.g., x₁202) to generate two separate versions (e.g., two temporally-transformed versions) of the first training content. A vector for each version of the first training content is generated (via the representational model F implemented by the set of CNNs 224). The two vectors are combined (via concatenation) to generate a single combined and/or concatenated vector associated for the two versions of the first training content. The single concatenated vector (e.g., concatenated vector 272) represents a transformation in feature space (e.g., τ_1→2) corresponding to the relative temporal transformation τ_1→2. The same two temporal transformations are applied to a second training content (e.g., x₂204). A single combined and/or concatenated vector (e.g., concatenated vector 274) for the two versions of the second training content is generated in a similar fashion. These two concatenated vectors (a first concatenated vector associated with the first and second versions of the first content and a second concatenated vector associated with the first and second versions of the second content) are employed as a positive-contrastive pair of vectors.

Two other temporal transformations (e.g., τ₃and τ₄) are applied to a third training content (e.g., x₃206) and a single combined and/or concatenated vector representation (e.g., concatenated vector 276) of the two versions of the third training content is generated. The concatenated vector associated with the two versions of the first content is paired with the concatenated vector associated with two versions of the third content to generate a negative-contrastive vector pair. Likewise, the concatenated vector associated with the two versions of the second content is paired with the concatenated vector associated with two versions of the third content to generate another negative-contrastive vector pair. This pattern of generating multiple versions of content (with additional pairings of additional temporal transformations from the set of temporal transformations) is continued to generate additional positive-contrastive and negative-contrastive pairs of vectors. The set of weights associated with the model is updated to “attract” the positive-contrastive pairs (e.g., increase the similarity metric between the pair) and to “repel” the negative-contrastive pairs (e.g., decrease the similarity metric between the pair) via the contrastive-learning loss function. The details of the training, via architecture 200 will now be discussed.

From the set of training content, including but not limited to x₁202, x₂204, and x₃206, a set of temporally transformed content 210 is generated, via the application of temporal transformations from the set of temporal transformations. In the non-limiting example shown in FIG. 2, two versions of x₁202 are generated via the application of two temporal transformations (e.g., τ₁and τ₂) of the set of temporal transformations: τ₁(x₁) 212 and τ₂(x₁) 214. The same two temporal transformations are applied to x₂204 to generate two versions of x₂204: τ₁(x₂) 216 and τ₂(x₂) 218. Two versions of x₃206 are generated via the application of two other temporal transformations (e.g., τ₃and τ₄) of the set of temporal transformations: τ₃(x₃) 220 and τ₄(x₃) 222. The application of temporal transformations to the training content are continued until two versions of each training content is generated. Note that in subsequent training iterations (and the first training iteration), in order to decrease systematic bias in the training, the assignment of temporal transformations to content and the assignment of the same temporal transformations to pairs of training content may be randomized, e.g., the assignments of temporal transformations to content and the pairing of content shown in FIG. 2 is non-limiting. Thus, the individual elements of the set of temporally transformed content 210 may be referenced by τ_j(x_i).

Each of the temporally transformed content samples may be non-temporally transformed also, to generate a set of temporally- and non-temporally transformed content 230. The set of non-temporally transformed content includes: σ₁(τ₁(x₁)) 232, σ₂(τ₂(x₁)) 234, σ₃(τ₁(x₂)) 236, σ₄(τ₂(x₂)) 238, σ₅(τ₃(x₃)) 240, and σ₆(τ₄(x₃)) 242. Thus, the individual elements of the set of non-temporally transformed content 230 may be referenced by σ_k(τ_j(x_i)). In some embodiments, the content may not be transformed via the non-temporal transformations. The set of CNNs 224 are employed implement the representational model (e.g., F), to generate a set of representational vectors 250. The notation F (σ_k(τ_j(x_i))) may be adopted to represent the vector representations. In some embodiments, the non-temporal transformations may omitted (in the notation) to simplify the notation. In such embodiments, the non-temporal transformations may be (but need not be) implemented, although their notation is omitted here to simplify the discussion. Thus, the two vector representations of the two versions of x₁202 may be referred to as F(τ₁(x₁)) 252 and F(τ₂(x₁)) 254, respectively. (e.g., the non-temporal transformations have been omitted to simplify the discussion). The two vector representations of the two versions of x₂204 may be referred to as F(τ₁(x₂)) 256 and F(τ₂(x₂)) 258, respectively. The two vector representations of the two versions of x₃206 may be referred to as F(τ₃(x₃)) 260 and F(τ₄(x₃)) 262, respectively.

A set of concatenated vectors 270 may be generated by concatenating the vector representations of each of two versions of each of the training contents. Thus, a concatenated vector may be generated for each of the input training contents. When generating the concatenated vectors, the column vectors may be transposed into row vectors and referenced as: [F(τ_p(x_i))^T, F(τ_p(x_i))^T], where the indexes p and q refer to the corresponding temporal transformations. In some embodiments, two additional indexes may be introduced to notate the corresponding non-temporal transformations. In embodiments that omit the non-temporal indexes, the concatenated vector for x₁202 may be referenced as [F(τ₁(x₁))^T, F(τ₂(x₁))^T] 272, the concatenated vector for x₂204 may be referenced as [F(τ₁(x₂))^T, F(τ₂(x₂))^T] 274 and the concatenated vector for x₃206 may be referenced as [F(τ₃(x₃))^T, F(τ₄(x₃))^T] 276. Note that these vectors correspond to a transformation in feature space, which corresponds to a temporal transformation. For example, each of the vectors: [F(τ₁(x₁))^T, F(τ₂(x₁))^T] 272 and [F(τ₁(x₂))^T, F(τ₂(x₂))^T] 274 corresponds to τ_1→2, which corresponds to τ_1→2.

The set of MLPs 226 is trained (via this training method) to classify the concatenated vectors, based on the transformations in feature space that are encoded in the concatenated vectors. The set of MLPs 226 implement a representation model (e.g., ψ) that is trained to correlate the transformations in feature space and the (relative) temporal transformations, which generates the equivariance of the CNN model (e.g., F). Processing the set of concatenated vectors 270 via the set of MLPs 226 generates a set of relative (temporal) transformation vectors 290. Each vector of the set of relative transformation vectors 290 may be referenced as: ψ_i^pq≡ψ(F(τ_q(x_i)^T, τ_p(x_i)^T))∈ custom-character ^D, where each unique combination of indexes p and q indicate a unique relative transformation in feature space, which correlates to a unique relative temporal transformation of the input content indexed via i. Thus, the relative transformation vector for x₁202 may be referenced as ψ₁^1,2292, the relative transformation vector for x₂204 may be referenced as ψ₂^1,2294, and the relative transformation vector for x₃206 may be referenced as ψ₃^3,4296.

Because the desired equivariance of the model indicates that the relative transformations are independent of the input content, the model should be trained to increase a similarity metric between relative transformation vectors where the relative (temporal) transformation is the same, e.g., ψ_i^p,q≈ψ_j^p,q, where i≠j. Additionally, the model should be trained to decrease a similarity metric between relative (temporal) transformation vectors where the relative transformation is different, e.g., ψ_i^p,q≠ψ_j^r,sfor p, q≠r, s. Thus, a contrastive learning loss function may be defined, where pairs of relative transformation vectors that correspond to the same relative transformation are “attracted” (e.g. as indicated by the similarity metric) and pairs of relative transformation vectors that correspond to different relative transformations are “repelled” (e.g. as indicated by the similarity metric). That is, the contrastive learning loss function may implement “gravity” for positive-contrastive pairs of vectors (e.g., the vector-pair comprising ψ₁^1,2292 and ψ₂^1,2294) and “anti-gravity” for negative-contrastive pairs of vectors (e.g., the vector-pair comprising ψ₁^1,2292 and ψ₃^3,4296 and the vector-pair comprising ψ₂^1,2294 and ψ₃^3,4296). Accordingly, as shown by the labeled solid-arrows in FIG. 2 ψ₁^1,2292 and ψ₂^1,2294 should be attracted (e.g., the associated similarity metric is increased) when the set of weights of the models are updated (e.g., via backpropagation). As also shown by the labelled dotted-arrows, ψ₁^1,2292 and ψ₃^3,4296 should be repelled, as well as ψ₂^1,2294 and ψ₃^3,4296. As noted above, pairs of relative transformation vectors that are attracted in an iteration of the training may be referred to as positive-contrastive pairs. Pairs of relative transformation vectors that are repelled in an iteration of the training may be referred to as negative-contrastive pairs. Note that in subsequent iterations of training the model, the pairing of relative transformation vectors into positive-contrastive pairs and negative-contrastive pairs of vectors will vary.

In various embodiments, the similarity metric between a pair of relative transformation vectors (e.g., general vectors {right arrow over (x)} and {right arrow over (y)}) may be based on a cosine similarity between the two vectors

$(e . g ., \frac{\vec{x} \cdot \vec{y}}{{ \vec{x} }_{2} \cdot { \vec{y} }_{2}}),$

where the “.” indicates the conventional dot product (e.g., the inner product, as defined for vectors with real components) between two vectors. In at least some embodiments, the similarity metric may be based on an exponentiation of the cosine similarity between the two vectors. In various embodiments, the contrastive-learning loss function ( custom-character _equi), where the subscript equi indicates the equivariance of the model, may be defined in terms of a similarity metric (d({right arrow over (x)}, {right arrow over (y)})) for vector pair ({right arrow over (x)}, {right arrow over (y)}). In one non-limiting embodiment, the loss function may be defined as:

$ℒ_{e q u i} = - 𝔼 [\log \frac{d (ψ_{i}^{p, q}, ψ_{j}^{p, q})}{d (ψ_{i}^{p, q}, ψ_{j}^{p, q}) + Σ_{r, s \neq p, q} d (ψ_{i}^{p, q}, ψ_{k}^{r, s})}],$

where the expectation value ( custom-character ) is summed over the positive-contrastive pairs of vectors and the negative-contrastive pairs of vectors. To simplify the notation, the vector “arrow” has been dropped in the above expression. The negation of the expectation value is to turn the function into a loss function that is to be decreased, because in at least one embodiment, the similarity metric between pairs of vectors is negative. In this embodiment, the similarity metric may be defined as:

$d (\vec{x}, \vec{y}) = \exp (\frac{1}{λ} \cdot \frac{\vec{x} \cdot stopgrad (\vec{y})}{{ \vec{x} }_{2} { stopgrad (\vec{y}) }_{2}}),$

where λ is a temperature hyper-parameter of the training and the stopgrad notations indicates that gradients are not backpropagated for the second vector ({right arrow over (y)}). The temperature hyper-parameter may be bounded as: 0<λ≤1.0. In at least one embodiment, λ=0.1. In each iteration, the gradient of the loss function is backpropagated (for one of the vectors in the positive-contrastive vector pairs) to decrease the loss function.

The architecture 200 shown in FIG. 2 may enable receiving a set of temporally-varying content (e.g., a set of video content that includes x₁202, x₂204, and x₃206, as well as other additional video clips). A set of training content pairs may be generated. In the set of training content pairs, each training content included in the set of training content pairs may be included in a single training content pair of the set of training content pairs. For example, x₁202 and x₂204 may be included in a first training content pair, while x₃206 an x₄(not shown in FIG. 2) may be included in a second training content pair. Additional training content pairs may be formed from additional training content not shown in FIG. 2. A (unique) pair of temporal transformations (e.g., randomly sampled from the set of temporal transformations) is assigned and/or associated with each training content pair from the set of training content pairs. For instance, a first pair of temporal transformations (τ₁, τ₂) is assigned to the first training content pair and a second pair of temporal transformations (τ₃, τ₄) is assigned to the second training content pair. Each pair of temporal transformations characterizes a relative temporal transformation applied to each of the training content included in the associated pair of training content. A separate pair of non-temporal transformation may be associated and/or assigned to each of the training content. For example, a first pair of non-temporal transformations (σ₁, σ₂) is assigned to x₁202, a second pair of non-temporal transformations (σ₃, σ₄) is assigned to x₂204, and a third pair of non-temporal transformations (σ₅, σ₆) is assigned to x₃206.

For each training content pair, two versions of each training content is generated based on applying each of the two associated temporal transformations to each of the training content included in the pair of training content. For example, each of (τ₁, τ₂) is applied to each of x₁202 and x₂204 to generate the τ₁(x₁) 212, τ₂(x₁) 214, τ₁(x₂) 216, and τ₂(x₂) 218. For each training content pair, the two versions of each training content may be updated based on an application of the non-temporal transformation associated with each of the training content included in the training content pair. For instance, the associated non-temporal transformations are applied to each of τ₁(x₁) 212, τ₂(x₁) 214, τ₁(x₂) 216, and τ₂(x₂) 218 to generate σ₁(τ₁(x₁)) 232, σ₂(τ₁(x₂)) 234, σ₃(τ₁(x₂)) 236, σ₄(τ₂(x₂)) 238, σ₅(τ₃(x₃)) 240, and σ₆(τ₄(x₃)) 242.

For each training content pair, a vector representation is generated for each updated version of each training content included in the pair of training content based on a representation model (e.g., F as implemented by the set of CNNs 224). For instance, the vector representations F (σ₁(τ₁(x₁))) 252, F (σ₂(τ₁(x₂))) 254, F (σ₃(τ₁(x₂))) 256, F (σ₄(τ₂(x₂))) 258, F (σ₅(τ₃(x₃))) 260, and F (σ₆(τ₄(x₃))) 262. For each training content pair, a concatenated vector is generated for each training content based on a combination (e.g., a concatenation) of the vector representations. For instance, for x₁202, the concatenated vector [F(τ₁(x₁))^T, F(τ₂(x₁))^T] 272 may be generate. For x₂204, the concatenated vector [F(τ₁(x₂))^T, F(τ₂(x₂))^T] 274 may be generated. For x₃206, the concatenated vector [F(τ₃(x₃))^T, F(τ₄(x₃))^T] 276 may be generated.

For each training content pair, a relative transformation vector may be generated (via MLP that implement a model ψ) for each training content. For instance, for x₁202, the relative transformation vector ψ₁^1,2292 may be generated. For x₂204, the relative transformation vector ψ₂^1,2294 may be generated. For x₃206, the relative transformation vector ψ₃^3,4296 may be generated. The weights of the models (e.g., F and ψ) may be updated based on the above contrastive-learning loss function and the similarity metric. When updating the weights, the contrastive-learning loss function attracts pairs of relative transformation vectors included in a common (or same) training content pair and repels pairs of relative transformation vectors that are included in separate training content pairs. The training process may be iterated until the models converged. That is, the training may continue until the values of the weights of F and ψ converge to stable values. During each iteration, the pairing of the training content, as well as the associations of the temporal and non-temporal transformations to the training content may be subject to a random and/or stochastic process.

In some embodiments, auxiliary training may be performed on the model F. A classifier model (implemented by a NN not shown in FIG. 2) may be jointly trained with the set of CNNs 224 to classify the temporal transformations. In such embodiments, the classifier NN may be trained to classify a playback speed of the temporal transformations. The classifier NN may be trained (via self-supervised methods) to classify a playback speed (e.g., 1×, 2×, 4×, 8×, and the like), associated with the temporal transformation (e.g., τ) based on the input vector F(τ(x)). For example, via the representation model F, the classifier NN may classify τ₃(x₃) 132 of FIG. 1C and τ₃(x₄) 142 of FIG. 1C (and thus τ₃) as being associated with a playback speed of 2×. In other embodiments, the classifier NN may associate τ₄(x₃) 134 of FIG. 1C and τ₄(x₄) 144 of FIG. 1C (and thus τ₄) with a 1× playback speed. Such a classifier may be referred to as a “playback speed classifier.” The playback speed classifier may take as input a vector (e.g., F(τ(x)) and associate a playback speed of the temporally-transformed video clip τ(x) (or associate the playback speed with the temporal transform τ). The playback speed classifier may be a non-linear classifier.

Another classifier NN model (e.g., a “playback direction classifier”) may be trained to learn the playback direction associated with a temporal transformation. Again, the playback d NN classifier may be trained jointly with F, via self-supervised methods. For example, via the representation model F, the playback direction classifier NN may classify τ₃(x₃) 132 and τ₃(x₄) 142 (and thus τ₃) as being associated with a “forward” playback direction. The playback direction classifier may associate τ₅(x₃) 136 of FIG. 1C and τ₅(x₄) 146 of FIG. 1C (and thus τ₄) with a “reverse” playback direction. The playback direction classifier may be a binary classifier.

In some embodiments, still another classifier model may be jointly trained with the representational model to determine a temporal ordering of pairs “temporally shifted” clips, e.g., (τ₁(x₁), τ₂(x₁)) of FIG. 1B. The temporal-ordering classifier may classifier τ₁(x₁) 112 as occurring prior to τ₁(x₁) 114. The temporal-ordering classifier may classify the temporal shift between a pair of vectors (τ_p(x), τ_q(x)) into one of three classes: (1) the temporal shift between τ₁and τ₂is positive (e.g., τ_p(x) occurs before τ_q(x)), (2) the temporal shift between τ₁and τ₂is negative (e.g., τ_p(x) occurs after τ_q(x)), or (3) there is a temporal overlap between τ₁and τ₂. For example, the temporal-ordering classifier may classify the pair of temporal clips (τ₃(x₃), τ₄(x₃)) of FIG. 1C, as well as the pair of temporal clips (τ₃(x₄), τ₄(x₄)) also of FIG. 1C, as belonging to e third category that indicates there is a temporal overlap in the paired clips. In some embodiments, the temporal-ordering classifier may receive a pair of concatenated vectors corresponding to the temporally shifted video clips, e.g., [(F (τ_p(x)))^T, (F (τ_q(x)))^T]. The temporal-ordering classifier may be a non-linear classier.

FIG. 3 illustrates an architecture 300 for instance contrastive learning, in accordance to various embodiments. In some embodiments, it may be a goal of the training of the representational model (e.g., F(x)), such that F(x) is sensitive to changes in the scene depicted in x. In such embodiments, in addition to the equivariant training architecture 200 of FIG. 2, F(x) will be jointly trained via the training architecture 300 of FIG. 3, which introduces an instance contrastive learning objective, such that F(x) is enabled to discriminate between separate instances of input content. A separate or supplementary training dataset may be generated, as shown in FIG. 3. Three instances of training content (e.g., x₁302, x₂304, and x₃306) are shown in FIG. 3. As with FIG. 2, additional instances of training content not shown may be included in the supplementary training dataset. Samples from the set of temporal transformations (e.g., custom-character ) and the set of non-temporal transformations (e.g., ) may be selected for each training content sample to generate transformed content 310.

More specifically, a temporal transformation (e.g., τ_p∈ custom-character ) and two separate non-temporal transformations (e.g., σ_p, σ_q∈) are selected and/or sampled for each sample of training content. The selected transformations are applied to the associated content to generate instances of transformed content 310, e.g., {tilde over (x)}_i^p,q=σ_qºτ_p(x_i)=σ_q(τ_p(x_i))∈ custom-character ^D. Due to the invariance of the model F to non-temporal transformations, the notation may be simplified by omitting the non-temporal transformation index (e.g., q): {tilde over (x)}_i^p=σ_qºτ_p(x_i)=σ_q(τ_p(x_i)). Accordingly, two (temporally and non-temporally) versions of each sample of training content may be generated. As shown in FIG. 3, the two versions of the x₁302 include σ₁ºτ₁(x₁)=x_1,1312 and σ₂ºτ₂(x₁)=x_1,2314. The two versions of the x₂304 include σ₃ºτ₁(x₂)=x_2,1316 and σ₄ºτ₂(x₂)=x_2,2318. The two versions of the x₃306 include σ₅ºτ₃(x₃)=x_3,1320 and σ₆ºτ₄(x₃)=x_3,2322. Each of the two versions of each training content is provided to a CNN (of the set of CNNs 324), which may implement the representational model (e.g., F(x), which is the same model (and is being jointly trained) as implemented in the set of CNNs 224 of FIG. 2).

The representational vectors generated by the set of CNNs 326 are provided to a set of multilayer perceptrons (MLPs) 326, which implement another representations model (e.g., ψ). The MLP model ψ may be a separate MLP model than the MLP model implemented by the set of MLPs 226 of FIG. 2, (e.g., ψ). The MLP model implemented by the set of MLPs 326 generates a vector for each of the two versions of each of the training content, as shown by the set of vectors 326. Using the above notation {tilde over (x)}_i^p,q=σ_qºτ_p(x_i)=σ_q(x_i)), each MLP may transform the vector as: ϕ_i^p=ϕ_i^p(F({tilde over (x)}_i^p) As noted above, due to the invariance of F to non-temporal transformation, in some embodiments, the reference to the non-temporal index (e.g., d) may be omitted: ϕ_i^p=ϕ_i^p(F({tilde over (x)}_i^p). Employing this notation, the vector representation of x_1,1312 may be referenced as ϕ₁^1,1332 and the vector representation of x_1,2314 may be referenced as ϕ₁^1,2334. The vector representation of x_2,1316 may be referenced as ϕ₂^2,3336 and the vector representation of x_2,2318 may be referenced as ϕ₂^2,4338. The vector representation of x_3,1320 may be referenced as ϕ₃^3,5340 and the vector representation of x_3,2320 may be referenced as ϕ₃^3,6342.

Because the goal is to learn instance discrimination, the training should attract pairs of vectors of the same instances (e.g., those vectors representing the same content that have been similarly or dissimilarly transformed in the temporal domain) and repel pairs of vectors that are of different content and/or different temporal transformations. Thus, as shown in FIG. 3, the vector pairs of (vector 332, vector 334), (vector 336, vector 338), (vector 340, vector 342) are attracted, while other vector pairs are repelled.

On non-limiting example of a contrastive learning loss function that accomplishes such a goal is:

$ℒ_{inst} = - 𝔼 [\log \frac{d (ϕ_{i}^{p}, ϕ_{i}^{q})}{d (ϕ_{i}^{p}, ϕ_{i}^{q}) + Σ_{j \neq i} d (ϕ_{i}^{p}, ϕ_{j}^{r})}],$

where d({right arrow over (x)}, {right arrow over (y)}) indicates the similarity metric (e.g., a cosine similarity) between the vector pair ({right arrow over (x)}, {right arrow over (y)}), as used above and the inst subscript indicates that this contrastive-learning function is used for instance contrastive learning, as shown in architecture 300.

The architecture 300 shown in FIG. 3 may enable receiving a set of temporally-varying content (e.g., a set of video content that includes x₁302, x₂304, and x₃306, as well as other additional video clips). To generate instances of content, a separate pair of non-temporal transformations may be associated with each input training content. For instance, a first pair of non-temporal transformations (σ₁, σ₂) is associated with x₁302, a second pair of non-temporal transformations (σ₃, σ₄) is associated with x₂304, and a third pair of non-temporal transformations (σ₅, σ₆) is associated with x₃306. To create the separate instances of the same training content, a separate pair of temporal transformations is associated with each input training content. For example, a first temporal transformation τ₁and a second temporal transformation τ₂are associated with x₁302. The same (or different) pair of temporal transformations is associated with x₂304, and an another pair of temporal transformations is associated with x₃306. The associations of the non-temporal and the temporal transformations with the training content may be subject to a random and/or stochastic process. Furthermore, these associations may be re-generated via a randomization for each iteration in the training process.

Two versions of each training content may be generated based on an application of the associated temporal transformation and an application of the associated pair of transformations to the training content. For example, a first version (x_1,1312) of x₁302 is generated via σ₁ºτ₁(x₁) and a second version (x_1,2314) of x₁302 is generated via σ₂ºτ₂(x₁). A first version (x_2,1316) of x₂304 is generated via σ₃ºτ₁(x₂) and a second version (x_2,2318) of x₂304 is generated via σ₄ºτ₂(x₂). A first version (x_3,1320) of x₃306 is generated via σ₅ºτ₃(x₃) and a second version (x_3,2322) of x₃306 is generated via σ₆ºτ₄(x₃).

A vector representation may be generated for each version of each training content may be generated based on a representational model (e.g., F). A set of 3D CNNs (e.g., CNNs 234 and/or CNNs 334) may implement the representational model. For instance, each of F(x_1,1), F(x_1,2), F(x_2,1), F(x_2,2), F(x₃₁), and F(x_3,2) may be generated. An updated vector representation of each version of each training content may be generated based on an MLP model. The MLP model (ϕ) may be implemented by a set of MLPs, including but not limited to set of MLPs 326. For instance, ϕ₁^1,1332, ϕ₁^1,2334, ϕ₂^2,3336, ϕ₂^2,4338, ϕ₃^3,5340, and ϕ₃^3,6342 may be generated. Note this MLP model (ϕ) may be separate from the MLP model (ψ) of architecture 200 of FIG. 2. The representation model (e.g., F as implemented by both architecture 200 and 300) may be trained by jointly iterating over the training of architecture 200 and 300.

The weights of the models (F and ϕ) may be updated based on a contrastive learning function that attracts pairs of vector representations for the same training content and repels pairs of vector representations for separate training content. Such a contrastive learning function is shown above. The training process may be iterated until the models converged. That is, the training may continue until the values of the weights of F and ψ converge to stable values. During each iteration, the pairing of the training content, as well as the associations of the temporal and non-temporal transformations to the training content may be subject to a random and/or stochastic process.

Generalized Processes for Training and Employing Equivariant Representational Models

Processes 400-600 of FIGS. 4-6, or portions thereof, may be performed and/or executed by any computing device, such as but not limited to computing device 700 of FIG. 7. FIG. 4 illustrates one embodiment of a method 400 for employing a representational model, which is consistent with the various embodiments presented herein. Process 400 begins at block 402, where temporally-varying source content (e.g., a video clip) is received. At block 404, a vector representation of the source content is generated based on employing an equivariant model (e.g., F). At block 406, the vector representation of the source content is employed to perform a machine vision task associated with the source content. Some non-limiting examples of such machine vision tasks may include generating a response in a video search engine request or query. Target content associated with the source content may be identified, detected, and/or retrieved based on the vector representation of the source content. In other embodiments, the machine vision task may include identifying and/or detecting an action associated with (e.g., depicted within) the source content. For instance, the source content may include a video depicting of a diver diving from a diving board. The video may be labeled with the detected action of diving.

FIG. 5 illustrates one embodiment of a method 500 for training a representational model, which is consistent with the various embodiments presented herein. Method (or process) 500 may be implemented by the training architecture 200 of FIG. 2. Method 500 begins at block 502, where a set of temporally-varying source content (e.g., a set of training video clips) is received. At block 504, a set of training content pairs (from the set of training content) may be generated. Each training content of the set of training content may be included in a single training content pair. The pairing of the training content may be randomized for each iteration of process 500. At block 506, a separate pair of temporal transformations (from a set of temporal transformations) is associated with each training content pair. The associations of the pair of temporal transformations with each training content pair may be randomized for each iteration of process 500. At block 508, a separate pair of non-temporal transformations (from a set of non-temporal transformations) is associated with each training content of each training content pair. The associations of the pair of non-temporal transformations with each training content may be randomized for each iteration of process 500.

At block 510, for each training content pair, two versions of each training content may be generated based on an application of the associated pair of temporal transformations to the training content in the training content pair. At block 512, for each training content pair, the two versions of each training content is updated based on an application of the associated non-temporal transformation. At block 514, for each training content pair, a vector representation of each updated version of each training content may be generated. The vector representation for each updated version of the training content may be generated based on a representational model (e.g., F). At block 516, for each training content pair, a concatenated vector is generated for each training content of the pair. The concatenated vector may be based on a combination (e.g., a concatenation) of the vector representations of each training content. At block 518, for each training content pair, a relative transformation vector may be generated for each training content based on an MLP model and the concatenated vector for the training content.

At block 520, the weights of the representational model and the MPL model may be update based on a contrastive learning loss function. The contrastive learning loss function may attract pairs of relative transformation vectors that are included in a common training content pair and repels pairs of relative transformation vectors that are in separate training content pairs. At decision block 522, it is determined whether the models (e.g., the weights of the models) have converged. If the models have converged, the training of the models may be terminated. If the models have not converged, then method 500 may return to block 504 to begin another iteration of method 500. In at least one embodiment, process 500 may flow to process 600 such that the representational model is jointly trained by method 500 and method 600.

FIG. 6 illustrates one embodiment of a method 600 for instance contrastive learning, which is consistent with the various embodiments presented herein. Method 600 may be implemented by the training architecture 300 of FIG. 3. The representational model (e.g., F) may be trained via a joint, serial, and/or parallel implementation of method 500 and method 600. Method (or process) 600 begins at block 602, where a set of temporally-varying source content (e.g., a set of training video clips) is received. The set of training content may be equivalent to the set of training content employed in method 500. In at least one embodiment, the set of training content of block 602 may be a (random) subset of the set of training content received in block 502.

At block 604, a set of training content pairs (from the set of training content) may be generated. Each training content of the set of training content may be included in a single training content pair. The pairing of the training content may be randomized for each iteration of process 600. At block 606, a separate pair of temporal transformations (from a set of temporal transformations) is associated with each training content pair. The associations of the pair of temporal transformations with each training content pair may be randomized for each iteration of process 600. At block 608, a separate pair of non-temporal transformations (from a set of non-temporal transformations) is associated with each training content of each training content pair. The associations of the pair of non-temporal transformations with each training content may be randomized for each iteration of process 600.

At block 610, for each training content pair, two versions of each training content may be generated based on an application of the associated pair of temporal transformations to the training content in the training content pair. At block 612, for each training content pair, the two versions of each training content is updated based on an application of the associated non-temporal transformation. At block 514, for each training content pair, a vector representation of each updated version of each training content may be generated. The vector representation for each updated version of the training content may be generated based on a representational model (e.g., F, the same model applied at block 514 of method 500). Note that method 500 and 600 are similar up to blocks 514 and 614 respectively. Thus, in some embodiments, a single method may be implemented that executes these block, and forks at block 514/614. At the fork, method 500 may continue with block 516 and method 600 may continue with block 616. Note that methods 500 and 600 employ separate MLP models and separate loss functions.

At block 616, the vector representation of each version of each training content is updated based on a MLP model. Note that the representation model implemented by methods 500 and 600 may be a common representational model (at blocks 514 and 614 respectively), while the MLP models (block 516 and 616 respectively) implemented in methods 500 and 600 may be separate MLP models. At block 616, the weights of the representational and MLP models may be updated based on a contrastive learning loss function. The contrastive learning loss function may attract pairs of vector representations for the same training content and repels pairs of vector representations for separate training content. At decision block 618, it is determined whether the models (e.g., the weights of the models) have converged. If the models have converged, the training of the models may be terminated. If the models have not converged, then method 600 may return to block 604 to begin another iteration of method 600. In at least one embodiment, process 600 may flow to process 500 such that the representational model is jointly trained by method 500 and method 600.

Additional Embodiments

Other embodiments include a method that includes receiving source content that varies across a temporal domain. A vector representation of at least a portion of the source content may be generated. Generated the vector representation of the source content may be based on employing a model that is equivariant with respect to a set of temporal transformations. The set of temporal transformations may be applicable to the source content across the temporal domain. Based on the representation of the source content, at least one of target content associated with the source content or an action associated with the source content may be identified.

In some embodiments, the model may be a representational model that is trained according to another method that includes applying a first temporal transformation of the set of temporal transformations to first training content (e.g., of a set of training content) that varies across the temporal domain. A first vector may be generated based on the model. The first vector may be a vector representation of the temporally-transformed first training content. The first temporal transformation may be applied to second training content (of the set of training content) that varies across the temporal domain. A second vector may be generated based on the model. The second vector may be a vector representation of the temporally-transformed second training content. A set of weights of the model may be adjusted and/or updated based on the first vector and the second vector. Adjusting the set of weights of the model may increase a first similarity metric (e.g., a cosine similarity metric) that is associated with the first vector and the second vector, when the first and second vectors are re-generated based on the model with the adjusted set of weights.

In some embodiments, the training method further includes applying a third temporal transformation of the set of temporal transformations to third training content that varies across the temporal domain. A third vector may be generated based on the model. The third vector may be a vector representation of the temporally-transformed third training content. The set of weights of the model may be adjusted based on the first vector and the third vector. Adjusting the set of weights decreases a second similarity metric that is associated with the first vector and the third vector, when the first and third vectors are re-generated based on the model with the adjusted set of weights.

The training model may further include applying a first non-temporal transformation of a set of non-temporal transformations to the temporally-transformed first training content to generate temporally- and non-temporally transformed first training content. The first vector may be generated based on the temporally- and non-temporally-transformed first training content. The first vector may be a vector representation of the temporally- and non-temporally-transformed first training content. A second non-temporal transformation of the set of non-temporal transformations may be applied to the temporally-transformed second training content to generate temporally- and non-temporally-transformed second training content. The second vector may be generated based on the temporally- and non-temporally-transformed second training content. The second vector may be a vector representation of the temporally- and non-temporally-transformed second training content.

The first training content may be or may include first video content. The first temporal transformation may be associated with a first playback speed (of the first video content). The second training content may be or may include second video content. The second temporal transformation may be associated with a second playback speed (of the second video content). The training method may include training a classifier model to classify the first vector as being associated with the first playback speed and to classify the second vector as being associated with the second playback speed.

In another embodiment, the first temporal transformation may be associated with a forward playback direction (of the first video content). The second temporal transformation may be associated with a reverse playback direction (of the second video content). A classifier model may be trained to classify the first vector as being associated with the forward playback direction and to classify the second vector as being associated with the reverse playback direction.

In still another embodiment, the first temporal transformation may be associated with a first temporal clipping (of the first video content). The second temporal transformation may be associated with a second temporal clipping (of the second video content). A classifier model may be trained to classify a concatenation of the first vector and the second vector as being one of a first non-overlapping temporal ordering of the first and second temporal clippings, a second non-overlapping temporal ordering of the first and second temporal clippings, or an overlapping temporal ordering of the first and second temporal clippings.

Another training method of a model may include accessing first content and second content that both vary across a temporal domain and a non-temporal domain of the contents. A first transformed version of the first content may be generated by applying a first temporal transformation, of a set of temporal transformations, to the first content. A first transformed version of the second content may be generated by applying the first temporal transformation to the second content. A first vector may be based on the model (under training) and the first transformed version of the first content. A second vector may be generated based on the model and the first transformed version of the second content. The first and second vectors may be associated with each other as a positive-contrastive vector pair. The (weights of the) model may be updated and/or adjusted. The model may be updated based on a contrastive-learning loss function such that a first similarity metric associated with the positive-contrastive vector pair is increased and the updated model is equivariant to the first temporal transformation.

The method may further include accessing third content that varies across the temporal domain and the non-temporal domain of the contents. A first transformed version of the third content may be generated by applying a second temporal transformation, of the set of temporal transformations, to the third content. A third vector may be generate based on the model and the first transformed version of the third content. The first and third vectors may be associated with each other as a negative-contrastive vector pair. The model may be updated based on the contrastive learning loss function such that a second similarity metric associated with the negative-contrastive vector pair is decreased and the updated model is equivariant to the second temporal transformation.

The training method may further include generating a second transformed version of the first content by applying a second temporal transformation, of the set of temporal transformations, to the first content. A second transformed version of the second content may be generated by applying the second temporal transformation to the second content. A third vector may be generate based on the model and the first transformed version of the first content. A fourth vector may be generated based on the model and the second transformed version of the first content. A fifth vector may be generated based on the model and the first transformed version of the second content. A sixth vector may be generated based on the model and the second transformed version of the second content. The first vector may be based on a concatenation of the third vector and the fourth vector. The second vector may be generated based on a concatenation of the fifth vector and the sixth vector. Updating the model may be such that the updated model is equivariant to the second temporal transformation.

In some embodiments, the training method includes generating a third transformed version of the first content by applying a first non-temporal transformation, of a set of non-temporal transformations, to the first transformed version of the first content. A third transformed version of the second content may be generated by applying a second non-temporal transformation, of the set of non-temporal transformations, to the first transformed version of the second content. The third vector may be generated based on the model and the third transformed version of the first content. The fifth vector may be generated based on the model and the third transformed version of the second content. Updating the model may include updating the model such that the updated model is invariant to the first and second non-temporal transformations.

In some embodiments, the training method further includes generating a fourth transformed version of the first content by applying a third non-temporal transformation, of the set of non-temporal transformations, to the second transformed version of the first content. A fourth transformed version of the second content may be generated by applying a fourth non-temporal transformation, of the set of non-temporal transformations, to the second transformed version of the second content. The fourth vector may be generated based on the model and the fourth transformed version of the first content. The sixth vector may be generated based on the model and the fourth transformed version of the second content. The model may be updated such that the updated model is invariant to the third and fourth non-temporal transformations.

In at least one embodiment, the training method may further include generating a seventh vector based on a concatenation of the third and fourth vectors. An eight vector may be generated based on a concatenation of the fifth and sixth vectors. The first vector may be generated based on a multilayer perceptron model applied to the seventh vector. The first vector may encode a first relative temporal transformation, based on a combination of the first and second temporal transformations, applied to the first content. The second vector may be generated based on the multilayer perceptron model applied to the eight vector. The second vector may encode the first relative temporal transformation applied to the second content.

Still another training method includes generating first transformed content by applying a first temporal transformation (e.g., of a set of temporal transformations) to first content (e.g., of a set of training content). Second transformed content may be by applying the first temporal transformation to second content (e.g., of the set of training content). Third transformed content may be generated by applying a second temporal transformation (e.g., of the set of temporal transformations) to third content (e.g., of the set of training content). A representation (e.g., a vector representation) of the first content may be generated by applying a model to the first transformed content. A representation of the second content may be generated by applying the model to the second transformed content. A representation of the third content may be generated by applying the model to the third transformed content. The model may be updated and/or adjusted to increase a similarity (e.g., a cosine similarity metric) between the representation of the first content and the representation of the second content and to decrease a similarity between the representation of the first content and the representation of the third content.

The trained method (e.g., the updated model) may be employed by receiving source content. A representation (e.g., a vector representation) of the source content may be generated by applying the updated model to at least a portion of the source content. Based on the representation of the source content, other content that corresponds to the source content may be identified. In other embodiments, the updated and/or trained model may be employed by identifying, based on the representation of the source content, an action depicted in the source content.

The trained (or updated) model may be employed by receiving a temporally transformed version of first source content. A temporally transformed version of second source content may be received. Based on the updated model, the temporally transformed version of the first source content, and the temporally transformed version of second source content, a relative temporal transformation that was applied to each of the first source content and the second source content may be determined. The relative temporal transformation may be associated with at least one of a playback speed or a playback direction.

In some embodiments, the trained (or updated) model may be employed by receiving a first portion of video content. A second portion of the video content may be received. Based on the updated model, a temporal ordering of the first portion of the video content and the second portion of the video content may be determined.

Illustrative Computing Device

Having described embodiments of the present invention, an example operating environment in which embodiments of the present invention may be implemented is described below in order to provide a general context for various aspects of the present invention. Referring to FIG. 7, an illustrative operating environment for implementing embodiments of the present invention is shown and designated generally as computing device 700. Computing device 700 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing device 700 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated.

Embodiments of the invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a smartphone or other handheld device. Generally, program modules, or engines, including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. Embodiments of the invention may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialized computing devices, etc. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.

With reference to FIG. 7, computing device 700 includes a bus 710 that directly or indirectly couples the following devices: memory 712, one or more processors 714, one or more presentation components 716, input/output ports 718, input/output components 720, and an illustrative power supply 722. Bus 710 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of FIG. 7 are shown with clearly delineated lines for the sake of clarity, in reality, such delineations are not so clear and these lines may overlap. For example, one may consider a presentation component such as a display device to be an I/O component, as well. Also, processors generally have memory in the form of cache. We recognize that such is the nature of the art, and reiterate that the diagram of FIG. 7 is merely illustrative of an example computing device that can be used in connection with one or more embodiments of the present disclosure. Distinction is not made between such categories as “workstation,” “server,” “laptop,” “hand-held device,” etc., as all are contemplated within the scope of FIG. 7 and reference to “computing device.”

Computing device 700 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 700 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media.

Computer storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 700. Computer storage media excludes signals per se.

Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.

Memory 712 includes computer storage media in the form of volatile and/or nonvolatile memory. Memory 712 may be non-transitory memory. As depicted, memory 712 includes instructions 724. Instructions 724, when executed by processor(s) 714 are configured to cause the computing device to perform any of the operations described herein, in reference to the above discussed figures, or to implement any program modules described herein. The memory may be removable, non-removable, or a combination thereof. Illustrative hardware devices include solid-state memory, hard drives, optical-disc drives, etc. Computing device 700 includes one or more processors that read data from various entities such as memory 712 or I/O components 720. Presentation component(s) 716 present data indications to a user or other device. Illustrative presentation components include a display device, speaker, printing component, vibrating component, etc.

I/O ports 718 allow computing device 700 to be logically coupled to other devices including I/O components 720, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.

Embodiments presented herein have been described in relation to particular embodiments which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present disclosure pertains without departing from its scope.

From the foregoing, it will be seen that this disclosure in one well adapted to attain all the ends and objects hereinabove set forth together with other advantages which are obvious and which are inherent to the structure.

It will be understood that certain features and sub-combinations are of utility and may be employed without reference to other features or sub-combinations. This is contemplated by and is within the scope of the claims.

In the preceding detailed description, reference is made to the accompanying drawings which form a part hereof wherein like numerals designate like parts throughout, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the preceding detailed description is not to be taken in a limiting sense, and the scope of embodiments is defined by the appended claims and their equivalents.

Various aspects of the illustrative embodiments have been described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art. However, it will be apparent to those skilled in the art that alternate embodiments may be practiced with only some of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative embodiments. However, it will be apparent to one skilled in the art that alternate embodiments may be practiced without the specific details. In other instances, well-known features have been omitted or simplified in order not to obscure the illustrative embodiments.

Various operations have been described as multiple discrete operations, in turn, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation. Further, descriptions of operations as separate operations should not be construed as requiring that the operations be necessarily performed independently and/or by separate entities. Descriptions of entities and/or modules as separate modules should likewise not be construed as requiring that the modules be separate and/or perform separate operations. In various embodiments, illustrated and/or described operations, entities, data, and/or modules may be merged, broken into further sub-parts, and/or omitted.

The phrase “in one embodiment” or “in an embodiment” is used repeatedly. The phrase generally does not refer to the same embodiment; however, it may. The terms “comprising,” “having,” and “including” are synonymous, unless the context dictates otherwise. The phrase “A/B” means “A or B.” The phrase “A and/or B” means “(A), (B), or (A and B).” The phrase “at least one of A, B and C” means “(A), (B), (C), (A and B), (A and C), (B and C) or (A, B and C).”

EQUIVARIANT MODELS FOR GENERATING VECTOR REPRESENTATIONS OF TEMPORALLY-VARYING CONTENT

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims