The present invention is based on and claims foreign priority to Chinese patent application No. 202211110308.2 filed Sep. 13, 2022, the entire content of which is incorporated herein by reference.
The present disclosure relates to the technical field of visual dialogues, specifically, a visual dialogue method and system.
With the rapid development of a deep neural network (DNN), the integration of visual languages has also received more attention, such as image/video caption generation and visual question answering. A visual dialogue is an extension of visual question answering. Different from visual question answering which includes only a single round of question answering and lacks consistent and continuous interaction, the visual dialogue mainly researches a plurality of rounds of continuous question answering. The visual dialogue aims to make an accurate answer based on an input image, an input question, and a previous dialogue history. Visual dialogue can be used in many fields, such as AI-based blind assistance, robots, and voice assistants.
Visual and text content needs to be understood comprehensively and deeply to answer a current question accurately. A basic solution is to use an encoder to extract global features of an image, a question, and a historical dialogue, and then fuse these features into a joint representation to infer a final answer. However, this operation may lead to information redundancy and cannot avoid visual co-reference resolution. After that, researchers proposed many visual dialogue methods based on an attention mechanism or a graph model to mine necessary visual content and avoid co-reference. However, these methods almost focus only on the internal interaction of various inputs and are powerless to deal with a complex scene. The following mainly exists in the complex scene: 1) a lot of complex interactions between foreground objects and 2) a noisy background that may interfere with the foreground and confuse visual reasoning. Although some knowledge-based methods have been proposed later, they are all based on single knowledge and are limited in the improvement of a reasoning capability. In addition, they are also less effective in parsing long and difficult sentences.
The present disclosure is intended to provide a visual dialogue method and system to parse long or complex questions and corresponding answers and to handle visual scenes with complex interactions between entities to achieve a more accurate dialogue.
In order to resolve the above technical problem, the present disclosure adopts the following technical solutions:
The present disclosure provides a visual dialogue method, including:
Optionally, the preprocessing text data and image data in the original input data to obtain a text feature sequence and a visual feature sequence respectively includes:
Optionally, the potential knowledge searcher includes an aggregation operation unit, a Bi-LSTM unit, and a similarity calculation unit. The aggregation operation unit is configured to obtain the visual feature sequence. The Bi-LSTM unit is configured to obtain text data in the text corpus, and the similarity calculation unit is configured to calculate the similarity between the text data and the visual data to obtain the text sequence knowledge.
Optionally, the performing data fusion on the text feature sequence, the visual feature sequence, the text sequence knowledge, and the sparse scene graph to obtain a data fusion result includes:
Optionally, the obtaining a first attention result by using a question-guided first attention module based on the text feature sequence and the new question includes:
Optionally, the performing sentence-level attention guidance on the text feature sequence by using the new question to obtain an attention feature is implemented in the following manner:
where {tilde over (s)}z represents the attention feature, αrz represents a weight coefficient, and αrz=soft max (W1(fqz(qt)∘fpz(zr))+b1); zr represents a sentence feature of an rth round of dialogue; W1 and b1 represent a first learnable parameter and an offset, respectively; fqz and fpz represent nonlinear transformation layers; ∘ represents an element-wise multiplication operation; qt represents a sentence-level question feature; and r represents a quantity of rounds of dialogue.
The step of filtering the attention feature by using a sigmoid activation function is implemented in the following manner:
{tilde over (Z)}=gatez∘[qt, {tilde over (s)}z]
where {tilde over (Z)} represents the sentence-level sequential representation of the potential knowledge; gatez represents a gating function, and gatez=σ(W2[qt, {tilde over (z)}2]+b2); σ represents the sigmoid activation function; W2 represents a second learnable parameter; b2 represents a second offset; qt represents the sentence-level question feature; and {tilde over (s)}z represents the attention feature.
Optionally, the step of obtaining a word-level sequential representation of the potential knowledge by calculating a dot-product of an attention feature and the sigmoid activation function based on a word-level question feature of the new question is implemented in the following manner:
e
w=gatez∘[uq, {tilde over (w)}z]
where ew represents the word-level sequential representation of the potential knowledge; uq represents the word-level question feature; {tilde over (w)}z represents a word-level attention feature, and
represents an attention weight coefficient, and
where tr,jw=fqw(uq)Tfpw(ur,jZ); fqw and fpw represent nonlinear transformation layers; T represents a matrix transpose operation; ur,jZ represents a word-level feature of the text sequence knowledge; wr,jZ represents a word embedding feature of the text feature sequence; j represents a word scalar; and r represents a quantity of rounds of dialogue.
The present disclosure further provides a visual dialogue system, including:
Optionally, the data fusion subsystem includes:
Optionally, the potential knowledge searcher includes an aggregation operation unit, a Bi-LSTM unit, and a similarity calculation unit. The aggregation operation unit is configured to obtain the visual feature sequence. The Bi-LSTM unit is configured to obtain text data in the text corpus, and the similarity calculation unit is configured to calculate the similarity between the text data and the visual data to obtain the text sequence knowledge.
The present disclosure has the following beneficial effects.
The principles and features of the present disclosure are described below with reference to the accompanying drawings. The listed embodiments are only used to explain the present disclosure, rather than to limit the scope of the present disclosure.
The present disclosure provides a visual dialogue method. As shown in
S1: Original input data is obtained, where the original input data includes current image data and a new question, and the new question is related to the current image data.
In a visual dialogue, original input data includes current image data I and its text description C, dialogue history Ht={C, (Q1, A1), (Q2, A2), . . . , (Qt−1, At−1)} with respect to the current image data I, and new question Qt. By default, the text description C is part of the dialogue history and plays an initialization role in the first round of question answering. The visual dialogue obtains the dialogue content of the new question by sorting candidate answer At.
S2: Text data and image data in the original input data are preprocessed to obtain a text feature sequence and a visual feature sequence, respectively. Specifically including:
Herein, each word in the new question Qt is first embedded into {w1q, w2q, . . . , wLq} by using a pre-trained global vector (GloVe) model, where L represents a total quantity of words in the new question Qt. Then, the embedded word feature of each word is input into a Bi-LSTM network, and a feature of a last hidden layer is selected as a sentence-level question feature qt. In addition, the dialogue history Ht and the candidate answer At are also encoded by using the GloVe model and different Bi-LSTM networks to generate historical feature h and candidate answer feature at.
The image data is encoded by using faster R-CNN to obtain the visual feature sequence.
Herein, a faster R-CNN model obtained through pre-training on a Visual Genome dataset is used to extract an object-level feature of image data I and encode the object-level feature as visual feature sequence V={V1, V2, . . . , Vn} where n represents a quantity of objects detected in each image.
S3: A text corpus is constructed by using a VisDial dataset.
The VisDial dataset is a frequently used technical means in the art. Therefore, it is also common in the art to use the VisDial dataset to construct the text corpus. Details are not specifically described in the present disclosure.
S4: Text sequence knowledge is obtained by using a potential knowledge searcher based on the visual feature sequence and the text corpus.
Optionally, the potential knowledge searcher includes an aggregation operation unit, a Bi-LSTM unit, and a similarity calculation unit. The aggregation operation unit is configured to obtain the visual feature sequence, the Bi-LSTM unit is configured to obtain text data in the text corpus, and the similarity calculation unit is configured to calculate a similarity between the text data and the visual data to obtain the text sequence knowledge.
The potential knowledge searcher aims to find Top-K sentences most similar to the current image data from the corpus S. These sentences are considered potential sequence knowledge. It is worth noting that the corpus is composed of a text description of each image in a most popular visual dialogue dataset VisDial v1.0, which contains 12.3K sentences in total. In addition, it is verified through subsequent experiments that an optimal effect can be achieved by setting K to 10. More specifically, the searcher uses a global representation of the image and a sentence in the corpus to complete a search.
Furthermore, each sentence in the corpus S is embedded in the same way as the new question. Then, a dot-product is calculated between attention features and L2 normalization to aggregate text embeddings as a single vector that is used as a global text representation.
After obtaining the object-level feature V of the image I by using the faster RCNN, the present disclosure performs an aggregation operation to form an enhanced visual global representation. In order to find the Top-K sentences that approximate most to the image from the corpus, dot-product results of global representations of the Top-K sentences are used for measuring similarities between the Top-K sentences. Then, the potential searcher uses the similarities to sort the sentences in descending order to obtain Top-10 sentences that approximate most to the image, namely, Z={Z1, Z2, . . . , Z10}. After that, the Top-10 sentences are embedded in the same way as the current question to generate sentence feature z={z1, z2, . . . , z10}. In order to obtain more fine-grained knowledge, word-level knowledge feature urz={ur,iz}i=1N
Finally, sentences with high similarity are retrieved by calculating a similarity between a text global feature and a visual global feature and are taken as text sequence knowledge corresponding to the image for subsequent answer reasoning.
S5: A sparse scene graph is constructed based on the visual feature sequence.
The present disclosure generates a scene graph of the image by using Neural Motifs obtained through pre-training on the Visual Genome dataset. The scene graph is a structured representation of a semantic relationship between objects in the image. The scene graph is composed of two representations: a) set V={V1, V2, . . . , Vn}, which is an object-level representation of the image and also serves as a node of the scene graph; and b) set R={r1, r2, . . . , rm}, which represents a binary relationship between objects and is specifically an edge of the scene graph. Each relationship rk is a triple, which is similar to Vi, reli→j, Vj and consists of a start node Vi∈V, end node Vj∈V, and visual relationship reli→j. The present disclosure only detects Top-M relationships of each scene graph to reduce the impact of redundant and invalid information.
S6: Data fusion is performed on the text feature sequence, the visual feature sequence, the text sequence knowledge, and the sparse scene graph to obtain a data fusion result. This step specifically includes the following steps:
A first attention result is obtained by using a question-guided first attention module based on the text feature sequence and the new question.
Specifically, sentence-level attention guidance is performed on the text feature sequence by using the new question to obtain an attention feature.
Optionally, the question feature qt can guide the attention of the retrieved sentence in the following manner:
where {tilde over (s)}z represents the attention feature; αrz represents a weight coefficient, and αrz=soft max (W1(fqz(qt)∘fpz(zr))+b1); W1 and b1 represent a first learnable parameter and a first offset of a neural network respectively; zr represents a sentence feature of an rth round of dialogue; fqz and fpz represent nonlinear transformation layers; ∘ represents an element-wise multiplication operation; qt represents the sentence-level question feature; and r represents the quantity of rounds of dialogue.
The attention feature is filtered by using a sigmoid activation function to obtain a sentence-level sequential representation of potential knowledge.
The attention feature is filtered in the following manner by using the inherent sigmoid activation function:
{tilde over (Z)}=gatez∘[qt, {tilde over (s)}z]
where {tilde over (Z)} represents the sentence-level sequential representation of the potential knowledge; gatez represents a gating function, and gatez=σ(W2[qt, {tilde over (s)}z]+b2); σ represents the sigmoid activation function; W2 represents a second learnable parameter; b2 represents a second offset; qt represents the sentence-level question feature; and {tilde over (s)}z represents the attention feature.
A word-level sequential representation of the potential knowledge is obtained by calculating a dot-product of an attention feature and the sigmoid activation function based on a word-level question feature of the new question.
The first attention result is obtained based on the attention feature and the word-level sequential representation of the potential knowledge.
Optionally, a word-level sequential representation of the potential knowledge is obtained by calculating a dot-product between an attention feature and the sigmoid activation function based on a word-level question feature of the new question is implemented in the following manner:
e
w=gatez∘[uq, {tilde over (w)}z]
where ew represents the word-level sequential representation of the potential knowledge; uq represents the word-level question feature; {tilde over (w)}z represents a word-level attention feature, and
represents an attention weight coefficient, and
where tr,jw=fqw(uq)Tfpw(ur,jZ); fqw and fpw represent the nonlinear transformation layers; T represents a matrix transpose operation; ur,jZ represents a word-level feature of the text sequence knowledge; wr,jZ represents a word embedding feature of the text feature sequence; j represents a word scalar; and r represents the quantity of rounds of dialogue.
A second attention result is obtained by using a question-guided second attention module based on the text sequence knowledge.
A method for obtaining the second attention result is similar to that for obtaining the first attention result, and details are not described herein again.
The first attention result and the second attention result are cascaded to obtain a cascading result.
A third attention result is obtained by using a knowledge-guided attention module based on the visual feature sequence and the second attention result.
After obtaining a text sequence and graphic knowledge that are related to the question, original visual features extracted by the faster RCNN still need to be further aligned. All these features should be integrated in a reasonable way to obtain a correct answer through decoding. In the present disclosure, a knowledge-guided attention mechanism is used to complete semantic alignment between an image and potential text knowledge. In addition, an attention-based fusion mechanism is also applied to effectively integrate various features.
Given the word-level potential knowledge feature ew, the knowledge-guided attention mechanism is used to perform a calculation on the word-level potential knowledge feature and visual feature V to align cross-modal semantics. Specifically, the visual feature V is queried by calculating a dot-product of an attention feature and a multi-layer perceptron (MLP) based on the potential knowledge feature ew to generate the object-level feature {tilde over (V)}o most relevant to knowledge. A specific formula is as follows:
where fqv and fpv represent the nonlinear transformation layers, which are used to embed representations from different modalities into the same embedding space; ri,jv represents a correlation coefficient matrix; eiw represents a potential text knowledge feature output by a text sequence knowledge module; Vj represents an object feature in an input image; and nq represents a maximum quantity of words in the text sequence.
In existing methods, all features are directly connected to form a joint representation to infer an answer, which often leads to inefficient interaction. The present disclosure adopts a soft attention mechanism to generate the graph knowledge feature {tilde over (V)}o. In this way, attention can be paid to the question, a historical dialogue, text knowledge, and other valid information. Finally, all the features are further fused and input to an answer decoder to infer a final answer.
Graph convolution is performed on the sparse scene graph to obtain a graph convolution result.
Specifically, given the object feature Vi, neighborhood feature Vj (0≤j≤m; m represents a quantity of neighbor nodes of the object Vi), and their relationship feature reli→j, the module first fuses the neighborhood feature and the relationship feature as relationship content of the object Vi. The formula is as follows:
αij=soft max(W3[Vj, reli→j]+b3)
Then, a gate mechanism is used to fuse the original object feature vi and its relationship content {tilde over (v)}i, and details are as follows:
gateiv=σ(W4[Vi, {tilde over (V)}i]+b4)
{tilde over (v)}
i
g
=W
5(gateiv∘[Vi, {tilde over (V)}i])+b5
Where {tilde over (v)}ig represents an updated scene graph representation; αij represents the weight coefficient; Wi and bi represent learnable parameter and offset, with different values, wherein i=1˜5; and gateiv represents the gating function.
The data fusion result is obtained by using an attention-based fusion module based on the cascading result, the third attention result, and the graph convolution result.
S7: A new question's correct answer is obtained using a decoder based on the data fusion result.
100 candidate answers are represented as at={a1, a2, . . . , a100}. A discriminant decoder performs a dot-product operation between at and an output feature of an encoder. Then a softmax operation is performed to generate a categorical distribution of the candidate answers. Finally, optimization is performed by minimizing a multi-classification cross entropy loss function between one-hot encoding label vector y and the categorical distribution p. A specific formula is as follows:
Multi-task learning: There is also a decoder called a generative decoder in a visual dialogue model. In this decoder, an output of the encoder is fed to an LSTM network to predict an answer sequence. The decoder performs optimization by minimizing a negative logarithmic likelihood of a real answer label. The formula is as follows:
The multi-task learning combines discriminant and generative decoders. During training, loss functions of the discriminant and generative decoders are added, namely:
L
M
=L
D
+L
G
The answer decoder used in the present disclosure is the above multi-task learning mode.
The present disclosure further provides a visual dialogue system. As shown in
Optionally, the data fusion subsystem includes:
Optionally, the potential knowledge searcher includes an aggregation operation unit, a Bi-LSTM unit, and a similarity calculation unit. The aggregation operation unit is configured to obtain the visual feature sequence, the Bi-LSTM unit is configured to obtain text data in the text corpus, and the similarity calculation unit is configured to calculate a similarity between the text data and the visual data to obtain the text sequence knowledge.
The foregoing are merely descriptions of preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Any modifications, equivalent replacements, and improvements made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211110308.2 | Sep 2022 | CN | national |