This application claims priority from Korean Patent Application No. 10-2023-0025792 filed on Feb. 27, 2023 in the Korean Intellectual Property Office, and all the benefits accruing therefrom under 35 U.S.C. 119, the contents of which in its entirety are herein incorporated by reference.
The present disclosure relates to method for a multimodal embedding and system therefor, and more particularly, to a method and system for embedding multimodal data using a deep learning model.
Currently, there is a rapid surge of interest in the field of deep learning concerning multimodal tasks, which involve the concurrent handling of multimodal (or multi-modality) data. Moreover, there is an active pursuit of effective methodologies to embed multimodal data.
A method has recently been proposed for training deep learning models using multimodal paired datasets. Specifically, the proposed method employs datasets comprising pairs of images and text to learn embeddings (i.e., embedding representations) of multimodal data through contrastive learning tasks.
However, the proposed method exhibits a noticeable limitation by not preventing deep learning models from generating embeddings while focusing exclusively on distinctive regions within an input image (e.g., regions containing main objects). Consequently, the resultant embeddings lack faithfulness to contextual information within the input image (e.g., embeddings lack information regarding peripheral objects or backgrounds due to their exclusive focus on the main object regions).
Aspects of the present disclosure provide a method and system for accurately embedding multimodal data.
Aspects of the present disclosure also provide a method and system for accurately learning embeddings (i.e., embedding representations) of multimodal data when fine-grained annotation information is not provided.
Aspects of the present disclosure also provide a method and system capable of improving the performance of multimodal tasks.
However, aspects of the present disclosure are not restricted to those set forth herein. The above and other aspects of the present disclosure will become more apparent to one of ordinary skill in the art to which the present disclosure pertains by referencing the detailed description of the present disclosure given below.
According to some embodiments of the present disclosure, there is provided a method for multimodal embedding performed by at least one computing device. The method may include: generating a plurality of patch features for an image sample through an image encoder, wherein the image sample and text sample form a positive pair; generating a plurality of token features for a text sample through a text encoder; softly masking patch features associated with a specific token of the text sample; generating a joint embedding by inputting the masked patch features and the token features into a multimodal encoder; and updating the multimodal encoder by performing an image-text matching (ITM) task based on the joint embedding.
In some embodiments, the specific token may be randomly selected from among tokens of the text sample.
In some embodiments, the multimodal encoder may include at least one attention layer, which analyzes relationships between the token features and the patch features, and the softly masking the patch features associated with the specific token may include: extracting attention values for a feature of the specific token and the patch features from an attention map generated by the at least one attention layer; generating a soft mask for masking the patch features based on the attention values; and applying the soft mask to the patch features.
In some embodiments, the joint embedding is a first joint embedding, and the extracting the attention values may include: generating a second joint embedding by inputting the token features and the patch features into the multimodal encoder; calculating a matching score between the text sample and the image sample by performing the ITM task based on the second joint embedding; reflecting a gradient, which indicates the influence of the at least one attention map on the matching score, in the at least one attention map; and extracting the attention values from the at least one attention map with the gradient reflected therein.
In some embodiments, the extracting the attention values may include: aggregating a plurality of attention maps, generated in a plurality of attention layers; and extracting the attention values from the aggregated attention map.
In some embodiments, the image encoder and the text encoder may be updated through the ITM task.
In some embodiments, the method may further include: updating the image encoder and the text encoder by performing a contrastive learning task based on at least some of the patch features and at least some of the token features.
In some embodiments, the patch features may include a special patch feature corresponding to a special token, the token features may include a special token feature corresponding to the special token, a loss of the contrastive learning task may be calculated based on a similarity between the special patch feature and the special token feature.
In some embodiments, the loss of the contrastive learning task may be calculated based on a feature similarity and a focal weight, and the greater the feature similarity, the smaller the focal weight may be determined to be.
In some embodiments, the token features may include a feature corresponding to a mask token, the joint embedding is a first joint embedding, and the method further may include: generating a second joint embedding, which include a plurality of embeddings corresponding to tokens of the text sample, by inputting the patch features and the token features into the multimodal encoder; and additionally updating the multimodal encoder by performing a masked-language modeling (MLM) task based on an embedding corresponding to the mask token, among the plurality of embeddings.
In some embodiments, the token features may include a feature corresponding to a mask token and be obtained by substituting the specific token with the mask token.
In some embodiments, the updating the multimodal encoder may include: predicting a matching status between the image sample and the text sample by inputting at least some of the joint embedding into a prediction layer; and updating the multimodal encoder based on a loss from a result of the predicting.
In some embodiments, the joint embedding may include a plurality of embeddings, and among the plurality of embeddings, an embedding corresponding to a special token may be input into the prediction layer.
In some embodiments, the joint embedding includes a first embedding, and the method may further include: generating a second joint embedding by inputting the patch features and the token features into the multimodal encoder; and additionally updating the multimodal encoder by performing the ITM task based on the second joint embedding.
According to another embodiments of the present disclosure, there is provided a system for multimodal embedding. The system may include: at least one processor; and a memory configured to store at least one instruction, wherein the at least one processor, by executing the at least one instruction, performs operations including: generating a plurality of patch features for an image sample through an image encoder, wherein the image sample and text sample form a positive pair; generating a plurality of token features for a text sample through a text encoder; softly masking patch features associated with a specific token of the text sample; generating a joint embedding by inputting the masked patch features and the token features into a multimodal encoder; and updating the multimodal encoder by performing an image-text matching (ITM) task based on the joint embedding.
According to yet another embodiments of the present disclosure, there is provided a computer program stored on a computer-readable recording medium for executing, by being coupled to a computing device, the steps. The steps may include: generating a plurality of patch features for an image sample through an image encoder, wherein the image sample and text sample form a positive pair; generating a plurality of token features for a text sample through a text encoder; softly masking patch features associated with a specific token of the text sample; generating a joint embedding by inputting the masked patch features and the token features into a multimodal encoder; and updating the multimodal encoder by performing an image-text matching (ITM) task based on the joint embedding.
The above and other aspects and features of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:
Hereinafter, example embodiments of the present disclosure will be described with reference to the attached drawings. Advantages and features of the present disclosure and methods of accomplishing the same may be understood more readily by reference to the following detailed description of example embodiments and the accompanying drawings. The present disclosure may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of the disclosure to those skilled in the art, and the present disclosure will only be defined by the appended claims.
In adding reference numerals to the components of each drawing, it should be noted that the same reference numerals are assigned to the same components as much as possible even though they are shown in different drawings. In addition, in describing the present disclosure, when it is determined that the detailed description of the related well-known configuration or function may obscure the gist of the present disclosure, the detailed description thereof will be omitted.
Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that may be commonly understood by those skilled in the art. In addition, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase.
In addition, in describing the component of this disclosure, terms, such as first, second, A, B, (a), (b), may be used. These terms are only for distinguishing the components from other components, and the nature or order of the components is not limited by the terms. If a component is described as being “connected,” “coupled” or “contacted” to another component, that component may be directly connected to or contacted with that other component, but it should be understood that another component also may be “connected,” “coupled” or “contacted” between each component.
Embodiments of the present disclosure will hereinafter be described with reference to the accompanying drawings.
As illustrated in
The term “multimodal” denotes an environment that collectively handles multimodal (or multi-modality) data. Different modal data may refer to data with diverse types, forms, characteristics (e.g., statistical attributes), and/or domains. For example, text, images, and audio may be treated as different modal data. In another example, first and second data sets with different statistical attributes may also be treated as different modal data.
Specifically, the embedding system 10 may train the deep learning model 11 using datasets 12 comprising pairs of different modal samples (or data samples). For example, as illustrated in
The term “sample” or “data sample” refers to an individual unit of data that may be input to the deep learning model 11. In the context of the present disclosure, samples or a data samples may also be referred to as “examples,” “instances,” “observations,” or “individual data.”
Meanwhile, the embedding system 10 may perform multimodal tasks (i.e., objective tasks) using the trained deep learning model 11. For example, as illustrated in
Alternatively, the embedding system 10 may provide the trained deep learning model 11 to a separate device (not illustrated) for performing multimodal tasks or may receive data such as text or images from the separate device, generate embeddings for the received data (e.g., text embeddings, image embeddings, joint embeddings), and provide the generated embedded to the separate device.
The term “data embedding” refers to the representation of data within its designated embedding space. Therefore, embeddings may also be referred to as embedding representations. Moreover, since embeddings are typically in vector format, embedded may also be referred to as embedding vectors. In the context of the present disclosure, embedding vectors may also be referred to as “embedding codes,” “latent representations,” “latent vectors,” or “latent codes.”
The embedding system 10 may be implemented using at least one computing device. For example, all functionalities of the embedding system 10 may be implemented within a single computing device. Alternatively, first and second functionalities of the embedding system 10 may be implemented in first and second computing devices, respectively. Yet alternatively, a particular functionality of the embedding system 10 may be implemented across multiple computing devices.
The term “computing device” may encompass any device equipped with computing capabilities, and an exemplary computing device will be described later with reference to
General descriptions of the operations of the embedding system 10 have been presented, referring to
For the convenience of understanding, it is assumed that all steps/operations of the methods are conducted within the embedding system 10. Therefore, if the subject of a particular step/operation is not explicitly mentioned, it may be inferred that the particular step/operation is performed within the embedding system 10. Nevertheless, in real-world scenarios, some steps/operations of the methods that will hereinafter be described may also be executed in other computing devices.
Referring to
The paired datasets may include positive pairs and/or negative pairs. The positive pairs may refer to pairs where text samples and image samples are matched, while the negative pairs may refer to pairs where text samples and image samples are not matched. In some cases, the negative pairs may be generated during a training process. For example, the embedding system 10 may (randomly) select text samples and image samples that are not paired from among the paired datasets and use the selected text samples and image samples as the negative pairs.
In S32, an image-text sample pair may be selected from among the paired datasets. The selected image-text sample pair may correspond to either a positive or negative pair. If training is conducted in units of batches, a collection of image-text sample pairs equivalent to the batch size may be selected to form a batch, and subsequent steps may also be performed in the units of batches.
In S33-1, a plurality of patch features for an image sample may be generated through an image encoder. For example, the embedding system 10 may partition the image sample into a plurality of patches (i.e., patch sequences) and generate a plurality of patch features by encoding the patches through the image encoder.
Patch features may also be referred to as “visual features” or “image features.” Additionally, since patch features may operate as embeddings of a patch (or an image sample), patch features may also be referred to as “patch embeddings” or “image embeddings.” Moreover, the term “patch” may be used interchangeably with “token.” In other words, an image token may represent an image patch.
The image encoder may also be referred to as an “image/visual embedding model,” “image/visual embedder,” or “image/visual encoding module.”
The image encoder may be understood as being a component of the deep learning model 11 of
Referring to
The image encoder 40 may further receive a predefined special token, i.e., a classification (CLS) token 42, and output a corresponding patch feature 44, which may also be referred to as a “special patch feature,” but the present disclosure is not limited thereto. The technical significance of the CLS token 42 and the implication of general information on the image sample 41 by the feature 44 are already well known in the art to which the present disclosure pertains, and thus, detailed descriptions thereof will be omitted. It is assumed that the image encoder 40 further receives the CLS token 42. In
Referring back to
Token features may also be referred to as “text features.” Additionally, since token features may operate as embeddings of a token (or a text sample), token features may also be referred to as “token embeddings” or “text embeddings.”
The text encoder may also be referred to as a “text embedding model,” “text embedder,” “text embedding module,” or “language model.”
The text encoder may be understood as being a component of the deep learning model 11 of
Referring to
The text encoder 60 may further receive a predefined special token, i.e., a CLS token 62, and output a corresponding token feature 64, which may also be referred to as a “special token feature,” but the present disclosure is not limited thereto. The technical significance of the CLS token 62 and the implication of general information on the image sample 61 by the feature 64 are already well known in the art to which the present disclosure pertains, and thus, detailed descriptions thereof will be omitted. It is assumed that the text encoder 60 further receives the CLS token 62.
The structure of the text encoder 60 is as illustrated in
The embedding layer 71 may refer to a layer that receives each of multiple tokens 63 (e.g., receives the one-hot vectors of the respective tokens 63) and outputs token-level embeddings 75. The embedding layer 71 may be implemented as a neural network layer such as a fully-connected layer or a multi-layer perceptron (MLP), but the present disclosure is not limited thereto.
Each of the encoding layers 72 may be configured to include at least one self-attention layer 73 and at least one feedforward layer 74. The self-attention layer 73 may analyze the relationships between input embeddings 75, and the feedforward layer 74 may aggregate information on the input embeddings 75 based on the results of the analysis. The self-attention layer 73 and the feedforward layer 74 are already well known in the art to which the present disclosure pertains, and thus, detailed descriptions thereof will be omitted.
Referring back to
The multimodal encoder may be implemented based on attention mechanisms (e.g., cross-attention) or transformers, but the present disclosure is not limited thereto. The structure of the multimodal encoder will be described later in detail.
The multimodal encoder may also be referred to as a “multimodal/joint embedding model,” “multimodal/joint embedder,” or “multimodal/joint embedding module.”
In S35, a first image-text matching (ITM) loss may be calculated by performing an ITM task based on the joint embeddings. For a clearer understanding, S35 will hereinafter be described with reference to
Referring to
For example, the embedding system 10 may input an embedding/feature 95 corresponding to a CLS token, among the joint embeddings 94, into a prediction layer 91 to generate a matching score for the two samples (i.e., to predict the matching status of the two samples). The prediction layer 91 may be implemented as a binary classification layer (e.g., a fully-connected layer) performing binary classification for matching status (i.e., a matching or non-matching class), and a confidence score for the matching class or a processed value thereof may serve as the matching score. However, the present disclosure is not limited to this. Moreover, the embedding system 10 may calculate a first ITM loss 96 (“LITM”) based on the difference between the class-specific confidence score (or matching score) and the ground truth (e.g., the ground truth for a positive pair may be “matching”). The first ITM loss 96 may be calculated using a classification loss function such as cross-entropy loss, but the present disclosure is not limited thereto.
Alternatively, the embedding system 10 may perform an ITM task by inputting (or feeding) multiple embeddings included in the joint embeddings 94 or a representative embedding (e.g., an average embedding) of the multiple embeddings to the prediction layer 91. For example, if no CLS token is used, the embedding system 10 may calculate the first ITM loss 96 by calculating the representative embedding and inputting (or feeding) the calculated representative embedding to the prediction layer 91.
Referring back to
In S37, a second ITM loss may be calculated by performing an ITM task again using the masked patch features and token features. For example, the embedding system 10 may generate joint embeddings using a multimodal encoder in a similar manner to that described above in connection with S34 and S35 and perform an ITM task again based on the generated joint embeddings. For a clearer understanding, S37 will hereinafter be described in further detail with reference to
Referring to
S36 and S37 may be performed only for positive pairs. For negative image-text sample pairs, there are minimal or no image sample features associated with specific text sample tokens, and conducting an ITM task again offers almost no practical benefit.
A soft masking technique and S36 will hereinafter be described with reference to
The concept of soft masking and the rationale for employing soft masking in some embodiments of the present disclosure will be briefly described first with reference to
Referring to
Meanwhile, a method to enhance the performance of a multimodal embedding model by preventing each deep learning model from focusing solely on discriminative regions of an image through patch-level hard masking and increasing the difficulty of an ITM task may be explored. However, patch-level masking for masking a region of an image sample associated with a specific token (e.g., a token referring to a particular object) in units of patches requires fine-grained annotation information at an object level, incurring substantial costs in preparing paired datasets tagged with such annotation information. Moreover, even if such annotation information exists, it is not easy in general to mask a specific object region in units of patches. Furthermore, even if patch-level masking of the specific object region is possible, hard masking may degrade the performance of the multimodal embedding model because it causes significant damage to the features (or context) of the image sample by removing not only the features/information of the specific object region, but also the features/information of its surroundings.
For these and other reasons, a soft masking technique may be employed in some embodiments of the present disclosure. The soft masking technique performs masking to the extent that does not damage the features (or context) of the image sample, thereby increasing the difficulty of an ITM task and guiding each deep learning model to comprehensively understand various features (or information) of the image sample.
S36 will hereinafter be described with reference to
Referring to
In S122, attention values for a specific token and patch features may be extracted from the attention map. The extracted attention values may be understood as representing the relationships between features of the specific token and the patch features. For a clearer understanding, S122 will hereinafter be described with reference to
Referring to
The embedding system 10 may extract attention values for a region 137 of the attention map 134 that corresponds to the specific token. The embedding system 10 may normalize the attention map 134, for example, to values between 0 and 1, and extract attention values from the normalized attention map 134.
In some embodiments, the multimodal encoder 90 may be configured to include multiple attention layers. In this case, the embedding system 10 may aggregate multiple attention maps, generated in the multiple attention layers (e.g., through a method such as arithmetic mean, weighted average, etc.), and extract attention values from the attention map obtained by the aggregation. In this manner, a more sophisticated (or accurate) soft mask may be generated.
Moreover, in some embodiments, as illustrated in
Specifically, the embedding system 10 may perform an ITM task for a positive sample pair consisting of an image sample and a text sample to produce the matching score 136. As mentioned earlier, the embedding system 10 may generate the matching score 136 by inputting the joint embeddings 135 into the prediction layer 91.
Thereafter, the embedding system 10 may calculate the gradient of the matching score 136 for each pixel (i.e., for each attention value) with respect to the attention map 134. The significance of the gradient for the influence of the attention map 134 on the matching score 136 is obvious to those skilled in the art to which the present disclosure pertains, and thus, a detailed description thereof will be omitted.
Thereafter, the embedding system 10 may reflect the gradient into the attention map 134. For example, the embedding system 10 may reflect the gradient into the attention map 134 through an operation such as element-wise multiplication. Specifically, the embedding system may reflect the gradient into the attention map 134, as indicated by Equation (1):
where i denotes an i-th sample pair within a batch, AGCAM denotes an attention map with the gradient reflected therein, K denotes the number of attention layers (or attention maps), ReLU denotes a Rectified Linear Unit (ReLU), which is a type of activation function, qITM+ denotes the matching score for a positive pair, Ak denotes a k-th attention map, and the symbol ⊙ denotes an element-wise multiplication operation.
Referring back to
where Msoft denotes the soft mask, iw denotes the index of a selected token, and the symbol ∧ denotes a normalized state.
Thereafter, in S124, the soft mask may be applied to the patch features. For example, the embedding system 10 may apply the soft mask to the patch features through an operation such as element-wise multiplication. As a result, only the patch features associated with the specific token may be selectively weakened.
S36 has been described so far with reference to
Referring back to
Meanwhile, in some embodiments, the embedding system 10 may further calculate another type of loss, i.e., an image-text contrastive learning (ITC) loss, by performing an ITC task. The embedding system 10 may further update the weighted parameters of the encoders based on the ITC loss, which may also be reflected into the total loss. Consequently, semantic alignment between image embeddings (i.e., patch features) and text embeddings (i.e., token features) may be strengthened, and this will hereinafter be described later with reference to
Moreover, in some embodiments, the embedding system 10 may calculate another type of loss, i.e., a masked-language modeling (MLM) loss, by performing an MLM task. Thereafter, the embedding system 10 may update the weighted parameters of the encoders based on the MLM loss, which may also be reflected into the total loss. Consequently, embeddings (e.g., text embeddings or joint embeddings) that are faithful to contextual information of input text may be generated, and this will hereinafter be described later with reference to
Furthermore, in some embodiments, embedding learning may be performed. For example, the embedding system 10 may update the weighted parameters of the encoders by concurrently performing the aforementioned tasks earlier, as illustrated in
Referring back to
Meanwhile, although not explicitly illustrated in
The multimodal embedding method according to some embodiments of the present disclosure has been described so far with reference to
Moreover, a gradient indicating the influence of an attention map (generated in an attention layer of the multimodal encoder 90) on the matching score of a positive pair may be calculated. Attention values corresponding to the specific token may be extracted from an attention map with the calculated gradient reflected therein, and a soft mask may be generated based on the extracted attention values. In this case, a refined soft mask capable of selectively weakening only the features of the image sample that are associated with the specific token may be generated. Since this approach does not require fine-grained annotation information (e.g., object region information), self-supervised learning may be enabled, and the cost of preparing paired datasets may be considerably reduced.
An ITC task-based multimodal embedding learning method according to some embodiments of the present disclosure will hereinafter be described with reference to
Referring to
Thereafter, the embedding system 10 may calculate a feature (or embedding) similarity 159 between at least some of the patch features 153 and at least some of the token features 157.
For example, the embedding system 10 may calculate a feature similarity 159 between patch and token features 154 and 158 that correspond to a CLS token, using a similarity operation (e.g., multiplication, cosine similarity, etc.).
In another example, the embedding system 10 may calculate a feature similarity 159 between multiple patch features 153 and multiple token features 157 using similarity operations.
In another example, the embedding system 10 may calculate a representative feature for the patch features 153 and a representative feature for the token features 157 and may calculate a feature similarity 159 between the calculated representative features using similarity operations.
In another example, the embedding system 10 may calculate a first feature similarity (i.e., an image-to-text feature similarity) by performing the similarity operations on the patch features 154 first and then on the token features 158 and may calculate a second feature similarity (i.e., a text-to-image feature similarity) by performing the similarity operations in the opposite order (in a case where similarity operations do not satisfy the commutative law). The embedding system 10 may then calculate the feature similarity 159 using the weighted sum of the first and second feature similarities.
In another example, the embedding system 10 may calculate the feature similarity 159 using various combinations of the aforementioned examples.
If target features that are subject to similarity operations, i.e., the patch features 154 and the token features 158, have different sizes or different numbers of dimensions or are not suitable for the similarity operations, the embedding system 10 may modify the sizes (or the numbers of dimensions) of the target features (e.g., change the sizes of the target features to be the same) using a projection layer (e.g., a linear projection layer) before performing the similarity operations. In this case, the weight parameters of the projection layer may also be updated based on an ITC loss.
Thereafter, the embedding system 10 may calculate an ITC loss based on the feature similarity 159. For example, for a positive pair, the embedding system 10 may produce a smaller ITC loss for a greater feature similarity. On the contrary, for a negative pair, the embedding system 10 may produce a smaller ITC loss for a smaller feature similarity 159.
In some embodiments, the embedding system 10 may calculate ITC loss (e.g., a type of focal loss), by reflecting a focal weight in the feature similarity 159. The focal weight may be determined to be smaller for a greater feature similarity 159 and larger for a smaller feature similarity 159. In this manner, a smaller weight may be assigned to easier sample pairs and a greater weight may be assigned to more challenging sample pairs. For example, the embedding system 10 may calculate an ITC loss, as indicated by Equation (3):
where L*ITC denotes the ITC loss with the focal weight reflected therein, B denotes the batch size, Pv2t and Pt2v denote image-to-text and text-to-image feature similarities, respectively, (1-Pv2t) and (1-Pt2v) denote the focal weights for the image-to-text and text-to-image feature similarities Pv2t and Pt2v, respectively, and y denotes a factor for adjusting the focal weights.
Thereafter, the embedding system 10 may update the image encoder 40 and the text encoder 60 based on the ITC loss. Consequently, semantic alignment between the patch features 153 (or image embeddings), which are generated through the image encoder 40, and the token features 157 (or text embeddings), which are generated through the text encoder 60, may be enhanced. As a result, the embedding performance for multimodal data may be further improved.
The ITC task-based multimodal embedding learning method according to some embodiments of the present disclosure has been described so far with reference to
Referring to
Thereafter, the embedding system 10 may generate joint embeddings 166 by inputting patch features 165 and the token features 164 into the multimodal encoder 90. As mentioned earlier, the joint embeddings 166 may include embeddings for the corresponding tokens 162 of the text sample.
Thereafter, the embedding system 10 may predict a value 168 of the mask token 163 (i.e., the original token 162 yet to be substituted) by inputting (or feeding) the embedding 167 corresponding to the mask token 163 into a token prediction layer 161. The token prediction layer 161 may be configured to output confidence scores for predefined tokens (i.e., token-specific confidence scores) and may be implemented as, for example, a fully connected layer, but the present disclosure is not limited thereto.
Thereafter, the embedding system 10 may calculate an MLM loss indicating the difference between the result of token prediction result and the ground truth, using a classification loss function such as cross-entropy loss and may update the weight parameters of the image encoder 40, text encoder 60, and multimodal encoder 90 based on the MLM loss.
The MLM task-based multimodal embedding learning method according to some embodiments of the present disclosure has been described so far with reference to
Multimodal embedding learning based on the aforementioned tasks will hereinafter be described with reference to
As illustrated in
Also, the embedding system 10 may calculate an MLM loss 175-3 (“LMLM”) by performing an MLM task based on a joint embedding 176-1 corresponding to a mask token.
Thereafter, the embedding system 10 may generate a soft mask 173-2 for a specific token using the aforementioned method and may apply the soft mask 173-2 to the patch features 173-1. As a result, only patch features 173-1 that are associated with the specific token may selectively weakened.
Thereafter, the embedding system 10 may generate joint embeddings 176-2 by inputting the masked patch features 173-1 and the token features 173-2 into the multimodal encoder 90. Thereafter, the embedding system 10 may calculate the second ITM loss 175-4 by performing an ITM task again. The second ITM loss 175-4 may effectively prevent each deep learning model (e.g., the image encoder 40 or the multimodal encoder 90) from focusing solely on discriminative regions of an input image and instead may guide each deep learning model to comprehensively understand various features (information) of the input image and generate embeddings.
While not explicitly illustrated in
In some embodiments, the embedding system 10 may substitute a specific token of the text sample 172 with a mask token and may then perform ITM and MLM tasks using the soft mask 173-2 for the same token. Alternatively, the embedding system 10 may perform an ITC task using the patch features 173-1 and token features 173-2 with the soft mask 173-2 applied thereto for the same token. In these cases, semantic alignment between image embeddings (i.e., patch features) and text embeddings (i.e., token features) may be further improved.
Multimodal embedding learning based on various tasks has been described so far with reference to
Referring to
The processor 181 may control the overall operations of the components of the computing device 180. The processor 181 may be configured to include at least one of a central processing unit (CPU), a micro-processor unit (MPU), a micro controller unit (MCU), a graphic processing unit (GPU), and any other known form of processor in the field to which the present disclosure pertains. The processor 181 may perform computations for at least one application or program for executing operations/methods according to some embodiments of the present disclosure. The computing device 180 may be equipped with one or more processors.
The memory 182 may store various data, commands, and/or information. The memory 182 may load the computer program 186 from the storage 185 to execute the operations/methods according to some embodiments of the present disclosure. The memory 182 may be implemented as a volatile memory such as a random-access memory (RAM), but the present disclosure is not limited thereto.
The bus 183 may provide communication functionality among the components of the computing device 180. The bus 183 may be implemented in various forms, including an address bus, a data bus, and a control bus.
The communication interface 184 may support both wired and wireless Internet communication for the computing device 180. Additionally, the communication interface 184 may also support various other communication methods. For this purpose, the communication interface 184 may be configured to include a communication module that is well known in the field to which the present disclosure pertains.
The storage 185 may temporarily store at least one computer program 186. The storage 185 may be configured to include a non-volatile memory (such as a read-only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), or a flash memory), a hard disk, a removable disk, or any other well-known computer-readable medium in the field to which the present disclosure.
The computer program 186 may include one or more instructions that, upon being loaded into the memory 182, direct the processor 181 to perform the operations/methods according to some embodiments of the present disclosure. In other words, by executing the loaded instructions, the processor 181 may perform the operations/methods according to some embodiments of the present disclosure.
For example, the computer program 186 may include instructions to perform the following operations: generating a plurality of patch features for an image sample through an image encoder; generating a plurality of token features for a text sample through a text encoder; softly masking patch features associated with a specific token of the text sample; generating joint embeddings by inputting the masked patch features and the token features into a multimodal encoder; and updating the multimodal encoder by performing an ITM task based on the joint embeddings, wherein the image sample and text sample form a positive pair. In this example, the embedding system 10 may be implemented by the computing device 180.
In some embodiments, the computing device 180 may refer to a virtual machine implemented based on cloud technology. For example, the computing device 180 may be a virtual machine operating on one or more physical servers within a server farm. In this example, at least some of the components of the computing device 180, i.e., the processor 181, the memory 182, and the storage 185, may be implemented as virtual hardware, and the communication interface 184 may be implemented as a virtual networking element such as a virtual switch.
An exemplary computing device 180 that may implement the embedding system 10 has been described so far with reference to
Various embodiments of the present disclosure and their effects have been described with reference to
According to the aforementioned and other embodiments of the present disclosure, each deep learning model for multi-modal embedding (e.g., an image encoder, a text encoder, or a multimodal encoder) may be trained by softly masking features of an image sample that are associated with a specific token (or word) and performing an image-text matching (ITM) task. In this case, each deep learning model may be effectively prevented from focusing solely on discriminative regions of an input image, enabling the creation of embeddings that encompass various information (or contextual information) on the input image. That is, each deep learning model may comprehensively understand the input image and may thereby generate embeddings. Furthermore, as the use of soft masking may increase the difficulty of an ITM task may, the multimodal embedding performance of each deep learning model may be further enhanced. Additionally, as the multimodal embedding performance (or accuracy) of each deep learning model is improved, the performance of various multimodal tasks (e.g., image-to-text retrieval, text-to-image retrieval, visual question answering, image captioning, etc.) may also be significantly improved.
Also, a gradient indicating the influence of an attention map (generated in the attention layer of the multimodal encoder) on the matching score of a positive pair may be calculated. Then, attention values corresponding to the specific token are extracted from the attention map with the gradient reflected therein, and a soft mask may be generated based on the extracted (or inverted) attention values. In this case, a sophisticated soft mask capable of selectively weakening the features of the image sample that are associated with the specific token may be generated. Since this approach does not require fine-grained annotation information (e.g., object region information), self-supervised learning may be enabled, and the cost of preparing paired datasets may be considerably reduced.
Also, by performing an image-text contrastive learning (ITC) task, semantic alignment between image embeddings (i.e., patch features) and text embeddings (i.e., token features) may be enhanced, leading to further improvement in the multimodal embedding performance of each deep learning model.
Also, by calculating a contrastive learning loss based on a focal weight, each deep learning model (e.g., the text encoder or the image encoder) may focus more on challenging sample pairs during training. Consequently, the multimodal embedding performance of each deep learning model may be further enhanced.
Also, by performing a masked-language modeling (MLM) task, embeddings (e.g., text embeddings, joint embeddings, etc.) faithful to contextual information of input text may be easily generated.
However, the technical concepts of the present disclosure are not limited to the effects set forth herein, and other effects not explicitly mentioned may be readily understood by those skilled in the art to which the present disclosure, from the provided description below.
The technical features of the present disclosure described so far may be embodied as computer readable codes on a computer readable medium. The computer readable medium may be, for example, a removable recording medium (CD, DVD, Blu-ray disc, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer equipped hard disk). The computer program recorded on the computer readable medium may be transmitted to other computing device via a network such as internet and installed in the other computing device, thereby being used in the other computing device.
Although operations are shown in a specific order in the drawings, it should not be understood that desired results may be obtained when the operations must be performed in the specific order or sequential order or when all of the operations must be performed. In certain situations, multitasking and parallel processing may be advantageous. According to the above-described embodiments, it should not be understood that the separation of various configurations is necessarily required, and it should be understood that the described program components and systems may generally be integrated together into a single software product or be packaged into multiple software products.
In concluding the detailed description, those skilled in the art will appreciate that many variations and modifications may be made to the example embodiments without substantially departing from the principles of the present disclosure. Therefore, the disclosed example embodiments of the disclosure are used in a generic and descriptive sense only and not for purposes of limitation.
| Number | Date | Country | Kind |
|---|---|---|---|
| 10-2023-0025792 | Feb 2023 | KR | national |