This disclosure generally relates to methods that generate visual relationship graphs that identify relationships between objects depicted in an image. More specifically, but not by way of limitation, this disclosure relates to using visual-relationship probes to generate graph structures that identify dependencies between the data objects depicted in the image.
Visual relationship models that describe object relationships in images have become increasingly important for high-level computer vision (CV) tasks that need complex reasoning. The visual relationship models are often organized in a structured graph representation called a scene graph, where nodes represent objects and edges represent relationships between objects. Recently, there have been significant progress with applying such scene graphs to various CV reasoning tasks such as image captioning, image retrieval, and visual reasoning.
Despite the progress, current visual relationship models still rely on manually-annotated relationship labels. As the number of objects represented by the visual relationship models increase, the number of relationships between the objects become even greater. It is thus difficult to collect enough manual annotations to sufficiently represent important but less frequently observed relationships. Consequently, current visual relationship models tend to focus on modeling only a few relationships being derived from a large number of manual annotations. Although conventional techniques have attempted to use external knowledge databases to help enrich visual relationships, the total number of relationships remain relatively low.
Self-supervised natural language processing (NLP) systems have been used to build contextualized language models of text corpuses without manual intervention. The removal of human annotators from the training phase has enabled training on large unlabeled datasets and led to significant advances in NLP performance. These self-supervised algorithms have also brought advances in vision-language (VL) pre-training tasks. Existing VL techniques concatenate visual objects and the corresponding sentences as one input and apply a transformer module to learn contextualized multi-modal representations in a self-supervised manner. The existing VL techniques, however, rely heavily on the multi-head attention layers or attention distributions to identify implicit relations between objects. Each of the multi-head attention layers may have a distinct behavior without providing much context on how a particular object relates to another object. It is thus challenging to generate a visual relationship model based on the multi-head attention layers. In some instances, existing VL techniques generate an inaccurate visual relationship model due to difficulties in identifying implicit relationships between objects.
Certain embodiments involve using visual-relationship probes to generate graph structures that identify dependencies between the data objects depicted in the image. For example, a vision-language modeling application identifies an inter-modality representation derived from a data object of a plurality of multimodal data objects, in which the data object represents a region depicted in an image or a token characterizing at least part of the image. The visual-language modeling application generates a graph structure of the data object by processing the inter-modality representation. The graph structure identifies one or more dependencies of the data object to other multimodal data objects, in which the one or more dependencies were used to derive the inter-modality representation.
These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.
Features, embodiments, and advantages of the present disclosure are better understood when the following Detailed Description is read with reference to the accompanying drawings.
Certain embodiments described herein can address one or more of the problems identified above by generating graph structures (e.g., visual relationship models) that accurately identify relationships between data objects (e.g., regions of an image). The visual relationship models are then used to perform various vision-and-language tasks, such as image retrieval, visual reasoning, and image captioning.
In an illustrative example, a vision-language (VL) modeling application defines a set of regions in an image. For instance, if the image depicts three different objects (e.g., a container, a piece of hot dog, and a glass of water), the VL modeling application defines three regions, such that each region represents one of the objects depicted in the image. For each region, the VL modeling application generates an input embedding representative of the region. The input embedding identifies one or more visual characteristics of the region and a position of the region within the image. In the current example, an input embedding of a region representing the hot dog includes identifies a size or shape of the hot dog (e.g., a token embedding) and that the hot dog is located within the container (e.g., a position embedding). In some instances, the input embedding of the region includes an identifier usable to distinguish the region from other regions of the image (e.g., a segment embedding).
Continuing with this example, the VL modeling application applies a first transformer encoder to the input embedding to generate an intra-modality representation of the region. The intra-modality representation identifies an image object depicted in the region, in which the image object can be identified based on the first transformer encoder processing one or more other regions of the set of regions. A Bidirectional Encoder Representations from Transformers (BERT) encoder receives the input embedding of the region and identifies the image object depicted in the region (e.g., a bread) based on how the image object is associated with image objects depicted in other regions of the image (e.g., toppings, a napkin).
In this example, the VL modeling application applies a second transformer encoder to the intra-modality representation of the region to generate an inter-modality representation of the region. The inter-modality representation identifies that the region corresponds to one or more tokens that describe the image object depicted in the region. The tokens are derived from processing a natural-language text sequence. In the above example, the text sequence is an image caption that states “A container with a hot dog next to a tall glass of water.” The second transformer encoder generates the inter-modality representation of the region depicting the bread and identifies that the image object depicted in the region corresponds to the tokens “hot” and “dog.” By associating the image with the caption text, the VL modeling application may generate the graph structure that accurately identify regions of the image (“hot dog”), in contrast to processing image alone (“bread”).
The VL modeling application generates a graph structure that represents one or more dependencies between the region and the one or more other regions in the image. The graph structure is generated by processing the inter-modality representation of the region. In some instances, the dependencies indicate that the inter-modality representation of the region was derived in part by processing the one or more other regions. Continuing with the above example, the graph structure indicates that the inter-modality representation of the region was identified as “hot dog” based on a first dependency with a second image region identified as “topping” and a second dependency with a third image region identified as “tray.”
In some embodiments, a computer system uses the graph structure to perform various VL tasks. By identifying these dependencies that were used to identify the inter-modality representation, the graph structure can accurately convey information and improve performance of subsequent vision-and-language tasks (e.g., image retrieval, image captioning, visual reasoning). For example, a search engine provides a user interface for retrieving images with a query image as input. In addition to the query image, the search engine takes the graph structure an additional input. Specifically, the search engine provides the query image to the VL modeling application, at which the VL modeling application processes the query image to generate the graph structure. The VL modeling application transmits the generated graph structure back to the search engine. The search engine then utilizes spatial relationships and dependencies between image objects of the query image for image-based search, thereby retrieving images having similar spatial relationships. The results generated by using the graph structure retrieve more accurate images based on a query image, as compared to results generated by existing image-retrieval techniques using the same query image as input.
Certain embodiments described herein thus improve vision-language systems by using self-supervised techniques to identify explicit dependencies in visual objects or textual entities. The generation of such graph structure addresses issues with existing vision-language systems, which suffer from inefficiency (e.g., manual labeling), insufficient information (e.g., lack of relationship data), and heavily-constrained input (e.g., input requires a combination of image and text). By leveraging different aspects of transformer encoders, including masked language modeling and contrastive learning, image objects can be accurately identified while visual relationships between different objects can be discovered. The discovered relationships provide valuable information for performing subsequent vision-language tasks with high accuracy.
Various types of VL tasks can improve by leveraging implicit relationships obtained with the SSRP framework, thereby improving their performance. By performing image retrieval operations using the implicit visual relationships identified using the SSRP framework, image-based search engines can provide higher-quality results that takes into account the visual relationships contained in query images to users. As a result, an effective visual search is performed, thereby assisting users find their desired images. With respect to image captioning, more accurate and robust descriptions of images can be obtained with the implicit visual relationships generated by the VL modeling application. This can help blind (through text-to-speech conversion) or visually-impaired users to ‘see’ their surrounding environments better.
Further, certain embodiments described herein improve existing visual relationship or vision-language modeling by implementing an unsupervised or semi-supervised approach to model visual relationships between regions of the image. This is different from existing visual relationship models that heavily rely on fully-supervised, human-annotated labels. This addresses a common problem in existing techniques: manually annotating visual relationships is a highly subjective process in which different human annotators may annotate image objects with different information. Thus, the self-supervised technique described herein removes subjectivity and discovers implicit relations between data objects without requiring any annotations or labels.
Finally, certain embodiments described herein improve existing pretraining models that require vast amount of datasets for self-supervised training. In contrast, the transformer encoders and the relationship probe are configured specifically to train effectively with augmented data that can be quickly generated and easily integrated into the self-supervision objectives.
“Modality” refers to a certain type of information and/or the representation format in which information is stored. In some instances, modality includes audio type, text type, image type, tactile type, and other sensory data (e.g., smell, taste) types, each of which characterizing a particular data object.
“Intra-modality encoding” refers to a transformer-encoding process that transforms an input embedding to a contextual representation (e.g., an intra-modality representation) based on a relationship between a data object of a given modality and other data objects associated with the same modality. For example, the intra-modality encoding generates the intra-modality representation of a text token, based on its definition and position with respect to other text tokens in an input text sequence (e.g., a caption).
“Inter-modality encoding” refers to a transformer-encoding process that transforms a first contextual representation (e.g., intra-modality representation) to another contextual representation (e.g., an inter-modality representation) based on a relationship between a data object of a given modality and data objects associated with different modalities. For example, the inter-modality encoding generates the inter-modality representation of a text token, based on a relationship between the text token and one or more regions of an image associated with the input text sequence.
“Transformer encoder” refers to a machine-learning model that transforms a given sequence of elements, such as the sequence of words in a sentence, into another sequence. The transformer encoders involve semi-supervised learning including unsupervised pretraining followed by supervised fine-tuning. Pretraining is performed on a much larger dataset than fine-tuning.
“Relationship probing” refers to a visual and textual probing process that identifies relationships between image objects in an image or tokens in text data. In particular, the relationship probing uses the inter-modality representations to generate a graph structure (e.g., a latent relationship graph) that indicate relationships between the data objects within the same modality. In some instances, the relationships can be depicted as a node-edge graph structure, which can be overlaid on an input image. The graph structure can also be used as input for various vision-language tasks, such as image captioning, visual reasoning, and image retrieval.
“Data Augmentation” refers to a regularization technique that is used to avoid overfitting when training Machine Learning models. Data augmentation artificially boosts the range and number of training examples (e.g., training images) by transforming existing examples to create additional examples. For example, data augmentation is used to rotate, stretch, and reflect each training image to produce many variants, possibly yielding enough labeled data to improve training of a particular machine-learning model.
“Contrastive learning” refers to a technique for identifying similar and dissimilar objects (e.g., images) for a machine-learning model. For example, a machine learning model is trained to classify between similar and dissimilar images. In some instances, the contrastive learning is performed by learning generic representations of images on an unlabeled dataset, and then the machine-learning model can be fine-tuned with a small amount of labeled images to achieve good performance for a given classification task. The generic representations for the machine-learning model are learned by simultaneously maximizing agreement between differently transformed views of the same image and minimizing agreement between transformed views of different images.
The VL modeling application 102 then uses an input-embedding generator 112 to generate an image input embedding for each data object of the data objects 104. With respect to each of the image regions 108, the input-embedding generator 112 encodes one or more visual characteristics of the region and a position of the region within the image. In some instances, the input embedding of the region includes an identifier usable to distinguish the region from other regions of the image (e.g., a segment embedding). For the text tokens 110, the input-embedding generator encodes, for a given token, a definition of the token (e.g., a token embedding) and a position of the token within a text sequence (e.g., a position embedding).
The VL modeling application 104 then applies components of the SSRP framework 114 to process the input embeddings and generate the graph structure 106. The SSRP framework 114 includes an intra-modality encoder 116, an inter-modality encoder 118, and a relationship probe 120. The intra-modality encoder 116 transforms an input embedding to a contextual representation (e.g., an intra-modality representation) based on a relationship between a data object of a given modality and other data objects associated with the same modality. The intra-modality representation is used to predict an identity of its corresponding data object. For example, the intra-modality encoding generates the intra-modality representation of an image region, based on an image object depicted in the image region 108 as well as a relation of the image region relative to other image regions of the image regions 108.
The inter-modality encoder 118 a first contextual representation (e.g., intra-modality representation) to another contextual representation (e.g., an inter-modality representation) based on a relationship between a data object of a given modality and data objects associated with different modalities. For example, the inter-modality encoding generates the inter-modality representation of an image region 108, based on its associations with the text tokens 110. Similar to the intra-modality representation, the inter-modality representation is used to predict an identity of its corresponding data object, but the predicted identity of the inter-modality representation is more accurate than that of the intra-modality representation.
The relationship probe 120 identifies relationships between data objects 104 by processing the inter-modality representations to generate the graph structure 106 (e.g., a latent relationship graph) that indicates such relationships. In this example, the graph structure 106 depicts as a node-edge graph structure, in which nodes represent the image regions 108 and edges identify dependencies between the image regions 108. For example, a dependency identifies an amount of contribution of each image region towards the classification of an image region as a “tennis racquet”. In some instances, the graph structure 106 is overlaid on an input image that includes the image regions 108.
The graph structure 106 can also be used as input for various vision-language tasks, such as image captioning, visual reasoning, and image retrieval. Continuing with the above example, a visual-reasoning application uses the input image and the graph structure 106 to determine that the text sequence “a man reaches out to try to hit a tennis ball” matches the input image over another image depicting a person celebrating a point in a tennis match (not shown).
At step 202, the VL modeling application identifies a data object from a set of data objects. The data object can be a region of an image. In some instances, the data object is a token identified from a text sequence. The data object can be generated by performing data augmentation on an original data object. For example, data augmentation is used to rotate, stretch, and reflect the original data object to derive many variants, including the data object. In some instances, the data augmentation is performed on one or more parts of the input data or the input data as a whole.
At step 204, the VL modeling application generates an input embedding representative of the data object. For example, the input embedding encodes one or more visual characteristics of the region and a position of the region within the image. In some instances, the input embedding of the region includes an identifier usable to distinguish the region from other regions of the image (e.g., a segment embedding). Additionally or alternatively, the input embedding encodes, for a given token, a definition of the token and a position of the token within a text sequence. With respect to generating the input embedding corresponding to a token, the VL modeling application inserts the special tokens [CLS] and [SEP] before and after the text sequence that includes the token, and uses a tokenizer to split the text sequence.
At step 206, the VL modeling application applies an intra-modality encoding module to the input embedding to generate an intra-modality representation of the data object. The intra-modality encoding module transforms the input embedding to the intra-modality representation based on a relationship of the data object with other data objects associated with the same modality. For example, intra-modality encoding generates the intra-modality representation of a text token, based on its position with respect to other text tokens in an input text sequence (e.g., a caption). In another example, the intra-modality representation identifies an image object depicted in the region, in which the image object can be identified based on the first transformer encoder processing one or more other regions of the set of regions. In some instances, A Bidirectional Encoder Representations from Transformers (BERT) encoder receives the input embedding of the region and identifies the image object depicted in the region.
At step 208, the VL modeling application applies an inter-modality encoding module to the intra-modality representation and generates an inter-modality representation of the data object. The inter-modality encoding module transforms the intra-modality representation to an inter-modality representation based on a relationship between the data object and data objects associated with different modalities. For example, the inter-modality encoding generates the inter-modality representation of a text token, based on one or more regions of an image that are associated with the input text sequence of the text token. For example, the inter-modality representation identifies that the region corresponds to one or more tokens that describe the image object depicted in the region. As stated above, the tokens are derived from processing the text sequence.
At step 210, the VL modeling application 102 applies a relationship probing module to generate a graph structure that represents one or more dependencies between the data objects. For example, the graph structure identifies one or more dependencies between a plurality of regions of an image. The dependencies can indicate contribution of image regions towards the classification of a corresponding image region. The graph structure is generated by processing the inter-modality representations of the data objects. In some instances, the relationships can be depicted as a node-edge graph structure, which can be overlaid on an input image. The graph structure can also be used as input for various vision-language tasks, such as image captioning, visual reasoning, and image retrieval.
At step 212, the VL modeling application 102 uses the graph structure to perform a VL operation. The VL operation can be a VL understanding task (e.g., visual reasoning, visual question answer) or a VL generation task (e.g., image captioning). Depending on the type of VL operation, the intra-modality encoding module, the inter-modality encoding module, and the relationship probing module can be further trained using fine-tuning.
Structured representations of images according to visual relationships are beneficial for many vision and vision-language applications. However, current human-annotated visual relationship datasets suffer from the long-tailed predicate distribution problem which limits the potentials of visual relationship models. To increase efficiency of generating visual relationship datasets, a self-supervised technique is needed, such that the self-supervised technique implicitly learns the visual relationships without relying on any ground-truth visual relationship annotations. The VL modeling application improves existing VL techniques by using: 1) intra- and inter-modality encodings to respectively model relationships within each modality separately and jointly; and 2) relationship probing, which seeks to discover dependencies between modalities that are represented in the graph structure. By leveraging masked language modeling, contrastive learning, and dependency tree distances for self-supervision, the VL modeling application can learn object features that contribute to the implicit visual relationships. The graph structure can be used in various VL tasks that benefit from improved visual relationship understanding.
It has been demonstrated that visual relationships between objects can help improve performance on many CV tasks. Existing VL techniques assume a known explicit graph structure, and limit the graph to the most frequently occurring predicate categories while ignoring others that do not have enough labeled examples. Relaxing this assumption, some techniques transfer the object representations learned with predicate functions to rare predicates in few-shot scene graph generation. Other techniques capture the relations via attention mechanisms. However, unlike object detectors that are trained on unambiguous and objectively defined object class labels, visual relationships are subjective and it is difficult to exhaustively annotate all possible relationships between objects. By contrast, the VL modeling application identifies implicit visual relationships between regions of images using the accompanied captions, but without explicitly defined or labeled visual relationship classes (e.g., predicate labels).
In addition, pretraining machine-learning models can be used to solve various VL problems. The pretraining techniques generally employ BERT-like objectives to learn cross-modal representations from visual region features and word embeddings. Self- and cross-attention mechanisms are also used to learn joint representations that are appropriately contextualized in both modalities. Existing VL pretraining techniques heavily rely on massive amounts of visual-linguistic corpus. Moreover, although huge multi-modal training datasets enable pretraining techniques to learn good representations for downstream multi-modal VL tasks, they usually do not benefit visual tasks that only deal with single visual modality during inference. The VL modeling application overcomes this problem by generating implicit visual object relationships even with only visual inputs during inference, while benefiting greatly from the cross-modality learning objectives during training.
In some instances, the VL modeling application utilizes BERT-based network pretraining to learn a rich set of intermediate representations of both semantic and syntactic information and unearth the representations of dependency grammar relations in text (e.g., caption). Additionally or alternatively, the VL modeling application recovers dependency parse trees that have not been encountered during training. As such, the VL modeling application uses BERT to find visual relationships between image regions without explicitly training on relationship annotations.
In some embodiments, the VL modeling application implements a self-supervised relationship probing (SSRP) framework to identify dependencies between objects from the model's representation space. The SSRP framework is implemented with the following assumptions: (1) when images are slightly modified, the relative visual relationships of objects depicted in those images remain unchanged; (2) relationships between objects mentioned in image descriptions are visually observable in the corresponding image. The VL modeling application includes three modules, each consisting of a set of layers. In a first transformer encoding module, implicit intra-modal relationships are modeled using transformer encoders (e.g., a BERT encoder). In a second transformer encoding module, cross-modal learning is performed to identify implicit relationship information across different types of modalities. In the third relationship probing module, a relationship probe network is used to explicitly identify relationships between visual (e.g., image regions) and linguistic entities (e.g., text tokens) are represented explicitly as latent variables. In some instances, the three modules are trained using self-supervision, with a first stage relying on masked language modeling to train the first two modules, and a second stage relying on contrastive learning and linguistic dependency trees as supervisory signals to train the relationship probe network.
The VL modeling application uses the SSRP framework to find dependencies in visual objects or textual entities and to address issues with existing visual relationship models. First, the SSRP framework implements self-supervision rather than explicit supervision. Second, the SSRP framework explicitly models relationships as latent variables. Third, the SSRP framework leverages cross-modal learning but allows a single modality as input at prediction time. Various example experiments were presented to demonstrate that the VL modeling application can benefit both vision and VL understanding tasks.
(a) Types of the SSRP Framework
A difference among the three SSRP variants 302, 304, and 306 lies in the inter-modality encoding process. SSRPShare 302 shares the inter-modality encoder fInterVS across images and sentences, SSRPVisual 304 adopts an inter-modality encoder fInterV→S in which visual features unidirectionally attend to language features, The notation S->V to indicate textual features attends to visual features for inter-modality encoding. SSRPCross 306 uses a cross-attention encoder fInterV↔S in which features of a modality attends to features of another modality and vice versa. In some instances, the three different SSRP variants can be pretrained, fine-tuned and used to support different downstream tasks. Note that, SSRPCross can be used to support visual-textual multi-modal downstream tasks such as Visual Question Answering (VQA) tasks, while SSRPShare and SSRPVisual are used to process multi-modal downstream tasks but also single-modal visual tasks such as image captioning.
(b) Components of the SSRP Frameworks
Input Embeddings.
The input for the three SSRP pretraining models includes both visual and textual elements, where the former is defined as image regions-of-interest (RoIs) in an image and the latter is defined as the tokens in a caption. Given an image I, the an image-processing application applies a convolutional neural network (e.g., a Faster-RCNN) to an input image to detect RoIs V={v1, . . . , vN
Intra-Modality Encoding.
The VL modeling application uses a first transformer encoder to perform intra-modality encoding, thereby generating a model that identifies the intra-relations of the encoded representations in one modality via self-attention, similar to BERT. Specifically, the VL modeling application randomly masks out v\i and w\j with a fixed probability (e.g. 15%), and feed the masked image input embeddings {v1, . . . , v\i, . . . , vN
Inter-Modality Encoding.
The VL modeling application uses a second transformer encoder to perform inter-modality encoding, thereby generating a model that identifies cross-modality relationships between image and textual entities. The three SSRP pretraining models use different inter-modality encoding schemes as illustrated in
Relationship Probing.
The VL modeling application uses relationship probing to model the implicit relationship among visual or textual entities. Specifically, the VL modeling application generates a latent relationship graph for the objects in an image and a latent relationship graph for the tokens in a caption. In particular, the latent relationship graph structures are generated based on the unmasked contextual object representations v1:N
d
B
(ui,uj)2=(Bu(ui−uj))TBu(ui−uj))
where u∈{v, w}, i and j are the object/token indices, and Bu are the parameters for the probe layer. As discussed further below, the learning goal of a structural probe is to determine the edge distances between all pairs of nodes, in which the nodes correspond to image regions or tokens of the respective graph structures. The outputs of the visual probe and the textual probe layer are respectively the distance matrices Rv=(dB
(a) Architecture
A BERT model uses Masked Language Modeling (MLM), a self-supervised pretraining objective that allows a transformer encoder to encode a sequence from both directions simultaneously. Specifically, for an input sequence S=(w1, . . . , wN) of N tokens, BERT first randomly masks out 15% of the tokens and then predicts the masked tokens in the output. The masked tokens in the input sequence are represented by a special symbol [MASK] and fed into a multi-layer transformer encoder. For example, let Hl=(h1, . . . , hN) be the encoded features at the l-th transformer layer, with H0 being the input layer. The features at the (l+1)-th layer are obtained by applying a transformer block defined as:
H
l+1=LN(LN(Hl+fSelf-Attl(Hl))+fFFl(LN(Hl+fSelf-Attl(Hl))))
where LN stands for layer normalization, fSelf-Attl(⋅) is a multi-headed self-attention sub-layer, fFF(⋅) is a feed-forward sub-layer composed of two fully-connected (FC) layers, wrapped in residual connection with an LN. The token representations in the final layer are used to predict the masked tokens independently.
(b) Implementation
Decoder 420 may also include a stack of N layers 422. In addition to the two sub-layers in each encoder layer 412 described above, each layer 422 in decoder 420 may include a third sub-layer that performs multi-head attention over the output of the encoder stack. Similar to layers 412 in encoder 410, residual connections around each of the sub-layers may be used in layers 422 in decoder 420, followed by layer normalization. The self-attention sub-layer in the decoder stack may be modified (labeled as “masked multi-head attention”) to mask inputs to the decoder from future time steps and prevent positions from attending to subsequent positions. The masking, combined with offsetting the output embeddings by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. Decoder 420 may generate one word at a time from left to right. The first word generated at a layer may be based on the final representation of the encoder (offset by 1 position). Every word predicted subsequently may attend to the previously generated words at that layer of the decoder and the final representation of the encoder.
An attention function may map a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. A query vector q encodes the word/position that is paying attention. A key vector k encodes the word to which attention is being paid. The key vector k and the query vector q together determine the attention score between the respective words. The output is computed as a weighted sum of values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
where Q is the matrix of queries packed together, and K and V are the matrices of keys and values packed together. The scaled dot-product attention computes the dot-products (attention scores) of the queries with all keys (“MatMul”), divides each element of the dot-products by a scaling factor √{square root over (dk)} (“scale”), applies a softmax function to obtain the weights for the values, and then uses the weights to determine a weighted sum of the values.
When only a single scaled dot-product attention 530 is used to calculate the weighted sum of the values, it can be difficult to capture various different aspects of the input. For instance, in the sentence “I like cats more than dogs,” one may want to capture the fact that the sentence compares two entities, while retaining the actual entities being compared. To address this issue, the transformer 400 uses the multi-head self-attention sub-layer to allow the encoder and decoder to see the entire input sequence all at once. To learn diverse representations, the multi-head attention applies different linear transformations to the values, keys, and queries for each attention head, where different weight matrices may be used for the multiple attention heads and the results of the multiple attention heads may be concatenated together.
WiK with dimensions dmodel×dk
WiQ with dimensions dmodel×dk
WiV with dimensions dmodel×dv
The outputs of the multiple scaled dot-product attentions are concatenated, resulting in a matrix of dimensions di×(h×dv), where di is the length of the input sequence. Afterwards, a linear layer with weight matrix W0 of dimensions (h×dv)×de is applied to the concatenation result, leading to a final result of dimensions di×de:
MultiHead(Q,K,V)=Concat(head1, . . . ,headh)WO
where headi=Attention(QWiQ,KWiK,VQiV), (5)
where de is the dimension of the token embedding. Multi-head attention allows a network to jointly attend to information from different representation subspaces at different positions. The multi-head attention may be performed using a tensor operation, which may be split into multiple sub-operations (e.g., one for each head) and performed in parallel by multiple computing engines as described above.
In the example shown in
As indicated above, data augmentation is used to avoid overfitting when training Machine Learning models. Data augmentation artificially boosts the range and number of training examples (e.g., training images) by transforming existing examples to create additional examples. In some instances, data augmentation is used to rotate, stretch, and reflect each training image to produce many variants, possibly yielding enough labeled data to improve training of machine-learning models, including the transformer encoders and the relationship probe used by the VL modeling application.
The data augmentation is additionally performed on a text sequence 808 that accompanies the input image 802 (e.g., a caption). In this example, the text sequence 808 describes one or more objects depicted in the input image 802, such as “A dog standing in the grass near a flying Frisbee”. For such sentence-level type of data augmentation, transformer-based neural machine translation models pretrained on WMT19 can be used to perform back-translation and generate a set of augmented text data 810. Back-translation is referred as a translation of a target document from an original source language (e.g., English) to a target language (e.g., German), and back to the original source language. For ground-truth dependency trees, each sentence is parsed with the dependency parser provided by Stanza.
To train the transformer encoders and the relationship probe, a training system employ two learning stages. The training system can be a separate system from the VL modeling application, which trains and provides the machine-learning models to be used by the VL modeling application. In the first stage, the training system trains the transformer encoders (e.g., the BERT encoders) including the intra-modality encoders and the inter-modality encoders to obtain the contextual object representations v1:N
(a) Transformer Encoders
Masked Language Modeling with RoI Feature Reconstruction.
The training system trains transformer encoders, such as the BERT encoders, with the MLM objective to predict masked RoI feature vi and masked token wj given their surroundings I\v and S\v. In some instances, the training system includes an L1 reconstruction smoothing loss for grounding of visual features. The following loss function is defined:
MLM=−[log p(vi|I\i,{tilde over (S)})+log p(wj|S\j,Ĩ)−Σi smoothL
where Ĩ and {tilde over (S)} are the image regions and input worse with random masking, g(⋅) outputs the unmasked visual feature, p(vi|I\i,{tilde over (S)}) and p(wj|S\j,Ĩ) are respectively the predicted probabilities for the target object label and word given the masked inputs, and I and S are sampled from the training set . The symbols v and w were used to represent both the visual features and the label/word for simplicity.
Image-Text Matching.
An additional loss function is added to perform the instance-level alignment between an image and its corresponding text sequence. Both positive (y=1) and negative (y=0) image-sentence pairs are sampled and the model learns to align with a binary cross-entropy loss:
Match=−[y log p(falign)+(1−y)log(1−P(falign))]
where falign) is the output probability of a binary classifier and falign is the visual-textual alignment representation. For SSRPShare and SSRPVisual, falign is computed as galign (
The overall training loss for the first-stage pretraining thus becomes: Stage1=MLM+Match.
(b) Relationship Probes
In the second stage, the relationship probe layers are learned via a probe loss ProbeS and a contrastive loss CL-all, where the former is to ensure the learned textual relationships Rw is structurally consistent with a dependency tree and the latter is to ensure that the learned relationships Rv and Rw remain stable across different data augmentations.
For text data, the training system uses a pre-parsed dependency tree w for each sentence to guide the textual relationship probe learning with ProbeS, which is defined as:
where (wi, wj) is the distance between tokens wi and wj in the dependency tree w.
For the contrastive loss, the training system utilizes stochastic data augmentation techniques to transform an original image (or sentence) into semantics-preserving data samples, and treat them as positive pairs; see
where 1[k≠i]∈{0,1} is an indicator function, i,jx,y=((ix
Note that XCL is invariant to the order of sample indices (i,j) and thus is included just once in CL-all.
In this stage, the overall training objective is: Stage2=ProbeS+LCL-all.
After the training system trains the machine-learning models including the two transformer encoders and the relationship probe, the training system may fine-tune the above machine-learning models such that the models are configured to perform particular VL tasks. As referred herein, fine-tuning includes performing a secondary optimization to adjust the parameters of the trained transformer encoders and the relationship probe to solve a new set of problems. Fine tuning refers to refitting the weights of a trained unsupervised model to a supervised model.
(a) Visual Reasoning
The transformer encoders and the relationship probe are fine-tuned to solve visual reasoning tasks. Visual reasoning refers to a problem in which the machine-learning model is trained to determine whether a natural language caption is true about a pair of photographs. For example, the transformer encoders and the relationship probe are fine-tuned to solve tasks presented in Natural Language for Visual Reasoning 2 (NLVR2) datasets. The NLVR2 includes two language grounding datasets containing natural language sentences grounded in images. As stated above, visual reasoning requires the machine-learning models to determine whether the natural language statement S is true about an image pair (Ii, Ij)
To fine-tune the machine-learning models, the training system feeds alignment representations of the two images and probed relationships to a binary classifier:
p(Ii,Ij,S)=Sigmoid(fFC(fFC+GeLU+LN([qi;qj])))
q
k
=f
FC+GeLU+LN([falignk;Rkv;Rkw]), k∈{i,j}
f
align
k
,R
k
v
,R
k
w=SSRP(Ik,S)
where falignk, Rkv, and Rkw are outputs of SSRP(Ik, S), Sigmoid is defined as the sigmoid activation function of the binary classifier, and faligni and falignj are the visual-textual alignment representations for Ii, S and Ij, S, respectively.
For baseline models that do not consider relationships, the predicted probability is calculated as:
p(Ii,Ij,S)=Sigmoid(fFC(fFC+GeLU+LN([faligni;falignj])))
During fine-tuning, the models are optimized with binary cross-entropy loss functions.
(b) Visual Question Answering
The transformer encoders and the relationship probe are fine-tuned to solve Visual Question Answering (VQA) tasks. VQA refers to a vision-language task that aims to answer questions based on an image. A VQA system takes an image input and a free-form, open-ended, natural-language question about the image and produces a natural-language answer as the output.
VQA requires the trained machine-learning models to answer a natural language question Q related to an image I. For example, the machine-learning models are used to solve problems presented on the VQA v2.0 dataset, which the VQA v2.0 dataset includes open-ended questions about a set of images. The training system thus fine-tunes transformer encoders and the relationship probe on the train split and evaluate it on the test-standard split. Note that VQA is based on the COCO image corpus, but the questions have never been seen by the model during training. During the fine-tuning, we feed the region features and given question into our model, and then output the pooled features that are fed to a classifier for answer prediction:
p(I,Q)=Sigmoid(fFC(fFC+GeLU+LN(q)))
q=f
FC+GeLU+LN([falign;Rv;Rw])
f
align
,R
v
,R
w=SSRP(I,Q) (2)
During training, we fine-tune the model using the cross-entropy loss.
(c) Image Captioning
The transformer encoders and the relationship probe are fine-tuned for image captioning tasks. Image captioning refers to the process of generating a textual description from an image by analyzing the objects and actions depicted in the image. For image captioning, the training system fine-tunes only the image branch of SSRPVisual, and Feed the Unmasked Image features into SSRPVisual. In particular, the training system extracts refined contextualized visual representations v1:N
(d) Image Retrieval
For image retrieval, the training system feeds the unmasked image features into SSRPVisual and obtain the refined contextualized visual representations along with the implicit visual relationships. Image retrieval tasks aim to find similar images to a query image among an image dataset. For example, a search engine may implement the search per image feature to perform the image retrieval task.
For the Obj.+Rel. technique 1004, the training system uses the relationship-enhanced visual features obtained with
(a) Training
Pretraining Corpus.
To increase the amount of training data, pretraining phase uses combined pretraining datasets such as Conceptual Captions (CC), Stoney Brook University (SBU) captions, Microsoft® Common Objects in Context MSCOCO, VQA dataset, Question Answering on Image Scene Graphs (GQA), Visual Genome (VG), BooksCorpus (BC), and English Wikipedia (EW), etc. For this particular experiment, pretraining data is aggregated from the train (113k) and validation (5k) splits of MSCOCO. Specifically, with each MSCOCO image associated with five independent caption annotations, MSCOCO provides an aligned VL dataset of 591K image-and-sentence pairs on 118K distinct images. Table 1 summarizes the corpus used by different pretraining methods.
Data Augmentation.
As an alternative to combining existing VL datasets, the pretraining corpus was expanded via data augmentation on both images and sentences, as shown in Table 2. For data augmentation on images, horizontal flipping (HFlip) was applied at the image level and a few augmentations at the RoI feature level including HFlip, rotations (90°, 180°, and 270°) and bounding box jittering (with scale factors selected from the range of [0.8, 1.2]). With respect to text sequence, the training data was enriched through two pretrained back-translators: English-German-English (En-De-En) and English-Russian-English (En-Ru-En). These augmentation strategies generated significantly more training samples: 1.65M at RoI level and 1.77M at sentence level, while largely preserving the semantic information.
Pretraining Setting.
The three SSRP variants were trained with the augmented training data, as described above. The numbers of layers for the intra-modality encoders of fIntraS↔S and fIntraV↔V were set to 9 and 5, respectively, and the number of layers for the inter-modality encoders of fInterVS, fInterS→V and fInterV↔S were set to 5. For each transformer block, the hidden size was set to 768 and the number of heads were set to 12. To keep the sizes the same for the relationship matrices, the maximum numbers of words and objects were equally set to 36.
Pretraining was divided into two stages. In stage 1, Stage1 was used. At each iteration, input words and RoIs were randomly masked with a probability of 0.15. All models are initialized with BERT pretrained weights and the respective pretraining corpus is listed in Table 2. For cross-modality matching, each sentence was replaced with a mismatched one with a probability of 0.5. An Adam optimizer with a linear learning-rate schedule was used with a peak learning rate of 1e−4. The training is implemented with four Tesla V100 GPUs with a batch size of 128 for 10 epochs. After stage 1, the parameters of the intra-modality and inter-modality encoders were frozen, such that the relationship probes was trained with Stage2. The syntactic dependency tree for each sentence is generated. All variants of SSRP are trained for 30 epochs with Adam, a batch size of 512, and a learning of 5e−5.
Fine-Tuning tasks. The pretrained models were fine-tuned to handle multiple downstream tasks: three VL understanding tasks (NLVR2, VQA, and GQA) and a generation task (image captioning), following the standard fine-tuning settings for downstream tasks. For VL understanding tasks, linearly-fused probed relationships and visual-textual alignment prediction falign were used as features. For image captioning, the Up-Downframework was used and the refined object features learned by SSRPVisual were incorporated. The captioning model is first trained with cross-entropy loss and is then followed by reinforcement learning loss.
(b) Results
The experiment first performed ablation experiments over a few design choices of the present embodiments on NLVR2. The experiment then showed the comparison results on VQA, GQA and image captioning tasks.
Effect of Data Augmentation.
Table 3 shows the ablation study results. For the ‘Raw’ setting, the experiment pretrained the machine-learning models only on the original corpus, while in the ‘Aug.’ setting, the experiment augmented the original corpus with the augmentation techniques mentioned in Table 2. It is evident that the data augmentation strategy indeed improves the performance of all three models. Note that data augmentation was used only during pretraining, but not during fine-tuning.
Effect of Attention.
Comparing the three variants that use different attention settings in Table 3, we observe that SSRPCross performs the best, and SSRPVisual is better than SSRPShare. This confirms the benefits of the cross-attention structures that enable the features of one modality to attend to the other.
Effect of Relationship Probing.
To analyze the effectiveness of the visual and textual relationships learned via pretraining, the experiment concatenated the visual-textual alignment representation falign and relationships (Rel.) to form a relationship-aware feature vector for answer prediction. As seen from Table 3, using language relationships Rw leads to better results than using visual relationships Rv. This is due to the available dependency tree for supervising the language model during training, while the visual relationships are learned in a completely self-supervised way. Combining visual and textual relationships achieves the best results. Based on the results, SSRPCross (75.71) outperforms LXMERT (74.9) and VisualBERT (67.4) on NLVR2 dev-set, demonstrating that the probed relationships are beneficial for the reasoning task.
Results on VQA& GQA.
Table 4 shows the performance of our SSRPCross on VQA and GQA. The SSRPCross outperforms VilBERT and VisualBERT, while being highly competitive with the best method that is trained with considerably larger training corpora.
Results on Image Captioning.
Unlike the recent VL pretraining methods, which cannot be applied to single-modality vision tasks such as image captioning due to the cross attention used in pretraining, SSRPShare and SSRPVisual models do not have such a limitation. Thus, the experiment applied the stronger model SSRPVisual to image captioning using its refined object features and the learned implicit visual relationships. Table 5 shows the quantitative results, where SSRPVisual outperforms the baselines, indicating that the learned relationship-aware image representations can benefit image captioning. Note that the online results of BUTD are achieved with model ensemble, while we use a single model.
To further verify the benefits of implicit visual relationships in single-modality visual tasks, the experiment performed image retrieval on MSCOCO with SSRPVisual then compared results of the image retrieval with other techniques.
The VL modeling application thus uses self-supervised visual relationship probing techniques to implicitly learns visual relationships without training on ground-truth relationship annotations. The VL modeling application transfers the textual relationships from image descriptions to image objects and explores the visual relationships by maximizing the agreement between differently augmented images via contrastive learning. Through relationship probes, it has been demonstrated that relationship structures in images and sentences emerge with the application of well-designed distance and contrastive learning objectives.
Current representation learning models such as BERT and alike follow a similar structure. Probing the implicit knowledge that these models capture about language and vision can be beneficial and particularly advantageous in improving accuracy of downstream tasks such as visual reasoning and image retrieval. Self-supervised relationship probing is a push in that direction and can be used for grounding the relationships expressed in language.
As described above, the VL modeling application uses SSRP, which is a self-supervised relationship probing method for visual and textual relationship extraction. SSRP can be used to enrich the existing scene graph generation methods and to complete the missing relationships between objects. The visual relationships generated by the VL modeling application could be applied to a wide range of vision and vision-language applications including image captioning, image retrieval, object detection, visual question answering, visual reasoning, and visual-textual cross-modal retrieval.
Any suitable computing system or group of computing systems can be used for performing the operations described herein. For example,
The example of
The memory device 1304 includes any suitable non-transitory computer-readable medium for storing data, program code, or both. A computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable instructions or other program code. Non-limiting examples of a computer-readable medium include a magnetic disk, a memory chip, a ROM, a RAM, an ASIC, optical storage, magnetic tape or other magnetic storage, or any other medium from which a processing device can read instructions. The instructions could include processor-specific instructions generated by a compiler or an interpreter from code written in any suitable computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, and ActionScript.
The computing system 1300 could also include a number of external or internal devices, such as a display device 1310, or other input or output devices. For example, the computing system 1300 is shown with one or more input/output (“I/O”) interfaces 1308. An I/O interface 1308 can receive input from input devices or provide output to output devices. One or more buses 1306 are also included in the computing system 1300. Each bus 1306 communicatively couples one or more components of the computing system 1300 to each other or to an external component.
The computing system 1300 executes program code that configures the processing device 1302 to perform one or more of the operations described herein. The program code includes, for example, code implementing the VL modeling application 102 or other suitable applications that perform one or more operations described herein. The program code can be resident in the memory device 1304 or any suitable computer-readable medium and can be executed by the processing device 1302 or any other suitable processor. In some embodiments, all modules in the VL modeling application 102 are stored in the memory device 1304, as depicted in
In some embodiments, the computing system 1300 also includes a network interface device 1312. The network interface device 1312 includes any device or group of devices suitable for establishing a wired or wireless data connection to one or more data networks. Non-limiting examples of the network interface device 1312 include an Ethernet network adapter, a modem, and/or the like. The computing system 1300 is able to communicate with one or more other computing devices (e.g., a computing device that receives inputs for VL modeling application 102 or displays outputs of the VL modeling application 102) via a data network using the network interface device 1312.
An input device 1314 can include any device or group of devices suitable for receiving visual, auditory, or other suitable input that controls or affects the operations of the processing device 1302. Non-limiting examples of the input device 1314 include a touchscreen, stylus, a mouse, a keyboard, a microphone, a separate mobile computing device, etc. An output device 1316 can include any device or group of devices suitable for providing visual, auditory, or other suitable sensory output. Non-limiting examples of the output device 1316 include a touchscreen, a monitor, a separate mobile computing device, etc.
Although
Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, those skilled in the art will understand that the claimed subject matter could be practiced without these specific details. In other instances, methods, apparatuses, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.
The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multi-purpose microprocessor-based computer systems accessing stored software that programs or configures the computing system from a general purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages could be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
Embodiments of the methods disclosed herein can be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values could, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.
While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, could readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude the inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.