 
                 Patent Application
 Patent Application
                     20240420392
 20240420392
                    This application claims the benefit of Korean Patent Application No. 10-2023-0075718, filed Jun. 13, 2023, which is hereby incorporated by reference in its entirety into this application.
The disclosed embodiment relates to scene graph generation for understanding of various types of visual information to which computer vision technology in an Artificial Intelligence (AI) field is applied.
As a method for extracting information from images/videos, there is scene graph generation technology for inferring the interrelationship between objects.
A scene graph, which is configured with nodes and edges in order to effectively extract information from an image and a video scene, represents various objects and the relationship therebetween using the nodes and the edges, so it may be widely used in AI technology based on various types of visual information.
An object of the disclosed embodiment is to propose an encoder structure that enables more efficient scene graph generation.
A method for scene graph generation according to an embodiment may include extracting a first feature map from an input image, extracting a second feature map that is based on a mask for the shape of an object within a bounding box using the first feature map, generating a third feature map by combining the first feature map and the second feature map, and generating a scene graph by predicting the relationship between objects from the third feature map.
Here, extracting the second feature map may include generating multiple candidate bounding boxes by applying a convolution layer to the first feature map and generating multiple masks for shapes of objects within the multiple candidate bounding boxes using the first feature map.
Here, generating the multiple candidate bounding boxes may include applying a convolution layer of a first group to the first feature map and performing classification of the bounding box using a binary classifier; and predicting the bounding box by applying a convolution layer of a second group to the first feature map.
Here, predicting the bounding box may comprise setting offsets in multiple directions based on the center point of the object and estimating the location and size of the bounding box.
Here, generating the multiple candidate bounding boxes may comprise adjusting confidence of the predicted bounding box based on a confidence score for the classification of the bounding box and centeredness indicating the degree of matching between the center of the predicted bounding box and that of ground truth (GT).
Here, generating the masks may include extracting a region corresponding to the bounding box from the first feature map and warping the region into a feature map having a preset first resolution, acquiring a convolution feature map by applying a convolution layer to a warped feature map acquired as a result of warping, generating a max-pooled feature map and an average-pooled feature map by performing max pooling and average pooling on the convolution feature map, acquiring an attention map by combining the max-pooled feature map and the average-pooled feature map and applying a nonlinear function to the combination of the max-pooled feature map and the average-pooled feature map, acquiring an up-sampling result having a second resolution higher than the first resolution by performing up-sampling on a result of multiplying the attention map and the convolution feature map, and generating the mask by performing binary classification on the up-sampling result.
Here, extracting the first feature map may include extracting multiple feature maps for respective layers in a backbone network, forming a feature pyramid for fusing information of the multiple feature maps for the respective layers by adding the extracted multiple feature maps for the respective layers in reverse order, and extracting the first feature map having multi-resolution for the image using the feature pyramid.
An apparatus for scene graph generation according to an embodiment may include a backbone network for extracting a first feature map from an input image, an encoder for extracting a second feature map that is based on a mask for the shape of an object within a bounding box using the first feature map and generating a third feature map by combining the first feature map and the second feature map, and a decoder for generating a scene graph by predicting the relationship between objects from the third feature map.
Here, the encoder may include a bounding box generation unit for generating multiple candidate bounding boxes by applying a convolution layer to the first feature map and a mask generation unit for generating multiple masks for shapes of objects within the multiple candidate bounding boxes using the first feature map.
Here, the bounding box generation unit may apply a convolution layer of a first group to the first feature map, perform classification of the bounding box using a binary classifier, and predict the bounding box by applying a convolution layer of a second group to the first feature map.
Here, the bounding box generation unit may set offsets in multiple directions based on the center point of the object and estimate the location and size of the bounding box.
Here, the bounding box generation unit may adjust confidence of the predicted bounding box based on a confidence score for the classification of the bounding box and centeredness indicating the degree of matching between the center of the predicted bounding box and that of ground truth (GT).
Here, the mask generation unit may extract a region corresponding to the bounding box from the first feature map, warp the region into a feature map having a preset first resolution, acquire a convolution feature map by applying a convolution layer to a warped feature map acquired as a result of warping, generate a max-pooled feature map and an average-pooled feature map by performing max pooling and average pooling on the convolution feature map, combine the max-pooled feature map and the average-pooled feature map, acquire an attention map by applying a nonlinear function to the combination of the max-pooled feature map and the average-pooled feature map, multiply the attention map and the convolution feature map, acquire an up-sampling result having a second resolution higher than the first resolution by performing up-sampling on a result of multiplication, and generate the mask by performing binary classification on the up-sampling result.
Here, the backbone network may extract multiple feature maps for respective layers, form a feature pyramid for fusing information of the multiple feature maps for the respective layers by adding the extracted multiple feature maps for the respective layers in reverse order, and extract the first feature map having multi-resolution for the image using the feature pyramid.
A method for image encoding according to an embodiment may include generating multiple candidate bounding boxes by applying a convolution layer to a first feature map extracted from an input image, extracting a second feature map that is based on multiple masks for shapes of objects within the multiple candidate bounding boxes using the first feature map, and generating a third feature map by combining the first feature map and the second feature map.
Here, generating the multiple candidate bounding boxes may include applying a convolution layer of a first group to the first feature map and performing classification of a bounding box using a binary classifier; and predicting the bounding box by applying a convolution layer of a second group to the first feature map.
Here, predicting the bounding box may comprise setting offsets in multiple directions based on the center point of an object and estimating the location and size of the bounding box.
Here, generating the multiple candidate bounding boxes may comprise adjusting confidence of the predicted bounding box based on a confidence score for the classification of the bounding box and centeredness indicating the degree of matching between the center of the predicted bounding box and that of ground truth (GT).
Here, extracting the second feature map may include extracting a region corresponding to the bounding box from the first feature map and warping the region into a feature map having a preset first resolution, acquiring a convolution feature map by applying a convolution layer to a warped feature map acquired as a result of warping, generating a max-pooled feature map and an average-pooled feature map by performing max pooling and average pooling on the convolution feature map, acquiring an attention map by combining the max-pooled feature map and the average-pooled feature map and applying a nonlinear function to the combination of the max-pooled feature map and the average-pooled feature map, acquiring an up-sampling result having a second resolution higher than the first resolution by performing up-sampling on a result of multiplying the attention map and the convolution feature map, and generating the mask by performing binary classification on the up-sampling result.
The above and other objects, features, and advantages of the present disclosure will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
    
    
    
    
    
    
    
    
    
    
The advantages and features of the present disclosure and methods of achieving them will be apparent from the following exemplary embodiments to be described in more detail with reference to the accompanying drawings. However, it should be noted that the present disclosure is not limited to the following exemplary embodiments, and may be implemented in various forms. Accordingly, the exemplary embodiments are provided only to disclose the present disclosure and to let those skilled in the art know the category of the present disclosure, and the present disclosure is to be defined based only on the claims. The same reference numerals or the same reference designators denote the same elements throughout the specification.
It will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements are not intended to be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element discussed below could be referred to as a second element without departing from the technical spirit of the present disclosure.
The terms used herein are for the purpose of describing particular embodiments only and are not intended to limit the present disclosure. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,”, “includes” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Unless differently defined, all terms used herein, including technical or scientific terms, have the same meanings as terms generally understood by those skilled in the art to which the present disclosure pertains. Terms identical to those defined in generally used dictionaries should be interpreted as having meanings identical to contextual meanings of the related art, and are not to be interpreted as having ideal or excessively formal meanings unless they are definitively defined in the present specification.
  
Referring to 
  
Referring to 
The backbone 1 is a ResNet-based neural network, and extracts features from still images or video scenes input thereto.
The encoder 2 comprises six encoders, and the features input from the backbone 1 pass through the six encoders 20, whereby a feature map is generated.
The encoder 2 based on a transformer has the same structure as “Detection With Transformer”, which is commonly used in existing object detection models, and generates estimates using factorization. Through this, not the entire model but part of the model is improved using an iterative refinement method. Also, positional information of each pixel may be added to the generated feature map along with positional encoding.
The feature map is input to the decoder 3, and a scene graph is generated by predicting a subject, an object, and a predicate through the iterative refinement process.
  
Referring to 
An input image may be, for example, a ranch landscape image, as illustrated in 
The backbone network 110 may extract a first feature map from the input image. That is, the backbone network 110 may receive an image for object detection, that is, the input image, and extract a feature map having multi-resolution for the input image.
Here, various neural networks including a ResNet-based neural network or a VoVNet-based neural network, each being configured with multiple convolution layers, may be used in the backbone network 110. The backbone network 110 may extract feature maps having multiple resolutions or scales through operations, for example, up-sampling or down-sampling.
The feature maps for respective layers, which have different resolutions or scales and which are generated by the backbone network 110, may have different pieces of feature information depending on the layers, and a feature pyramid for fusing the different pieces of feature information is formed, after which the first feature map having multi-resolution may be extracted based on the feature pyramid.
According to an embodiment, the backbone network 110 may extract multiple feature maps for respective layers, form a feature pyramid for fusing information of the multiple feature maps for the respective layer by adding the extracted multiple feature maps for the respective layers in reverse order, and extract a first feature map having multi-resolution for the image using the feature pyramid.
The encoder 120 may extract a second feature map that is based on a mask for the shape of an object within a bounding box using the first feature map and generate a third feature map by combining the first feature map and the second feature map.
The decoder 130 predicts the relationship between objects from the third feature map output from the encoder 120, thereby generating a scene graph. That is, the decoder 130 may generate the scene graph by predicting a subject, an object, and a predicate through an iterative refinement process.
  
Referring to 
Accordingly, the first feature map having multi-resolution is extracted using the feature pyramid 13, and may then be provided to the encoder 120.
However, the present disclosure is not limited to the method described with reference to 
  
Referring to 
The bounding box generation unit 121 may generate multiple candidate bounding boxes by applying a convolution layer to the first feature map.
Here, the bounding box generation unit 121 may apply the convolution layer of a first group to the first feature map and perform classification of the bounding boxes using a binary classifier.
Here, the convolution layer of the first group may include multiple convolution layers, receive the first feature map extracted by the backbone network 110, and provide the output thereof as the input for the task of performing classification of bounding boxes.
Also, the bounding box generation unit 121 may predict a bounding box by applying the convolution layer of a second group to the first feature map.
Here, the convolution layer of the second group may include multiple convolution layers, receive the first feature map extracted by the backbone network 110, and provide the output thereof as the input for the task of performing prediction of a bounding box.
Here, according to an embodiment, the bounding box generation unit 121 may set offsets in multiple directions based on the center point of an object and estimate the location and size of a bounding box (centerness).
Here, according to an embodiment, the bounding box generation unit 121 may adjust the confidence of the predicted bounding box based on a confidence score for the classification of the bounding box and centeredness indicating the degree of matching between the center of the predicted bounding box and that of ground truth (GT).
Here, according to an embodiment, the mask generation unit 122 may extract a region corresponding to the bounding box from the feature map and warp the same into a feature map having a preset resolution, acquire a convolution feature map by applying a convolution layer to a warped feature map acquired as the result of warping, perform max pooling and average pooling on the convolution feature map, and combine the max-pooled feature map and the average-pooled feature map.
Also, the mask generation unit 122 acquires an attention map by applying a nonlinear function to the combination of the max-pooled feature map and the average-pooled feature map, multiplies the attention map and the convolution feature map, and performs binary classification for the multiplication result, thereby generating a mask.
According to an embodiment, object detection and segmentation based on points are performed without using a predefined anchor box requiring a high computational load and high memory usage, whereby efficiency may be achieved in terms the computational load and memory usage. Furthermore, real-time object detection and segmentation may be realized in various fields including robots, drones, autonomous vehicles, and the like based on a platform having low computing power (e.g., an embedded platform).
  
Referring to 
When the convolution layer CG1 of the first group, among the convolution layers CG1 and CG2 of the multiple groups, is applied to the input feature 21, the result may be provided as the input of the task for performing classification 23 of a bounding box. The task for performing classification 23 of a bounding box may classify a bounding box from the input feature 21 using, for example, a binary classifier.
Meanwhile, when the convolution layer CG2 of the second group, among the convolution layers CG1 and CG2 of the multiple groups, is applied to the input feature 21, the result may be provided as the input of the task for performing prediction 25 of a bounding box. The task for performing prediction 25 of a bounding box may set offsets in multiple directions based on the center point of an object and estimate the location and size of a bounding box.
For example, the task for performing prediction 25 of a bounding box may set offsets in four directions, including top (T), bottom (B), left (L), and right (R) directions, based on the center point of an object and estimate the location and size of the bounding box classified to surround the object.
Also, the task for performing prediction 25 of a bounding box may adjust the confidence of the predicted bounding box based on the centeredness 27, and the centeredness 27 may indicate the confidence score for the classification of the bounding box and the degree of matching between the center of the predicted bounding box and that of GT.
In this way, the bounding box generation unit 121 may predict or determine the bounding box for performing object detection based on a point (that is, a center point) without using a predefined anchor box.
However, the present disclosure is not limited to the method described with reference to 
  
Referring to 
Here, using the feature map extracted by the backbone network 110, the mask generation unit 122 may generate a mask 38 for the shape of an object within the bounding box predicted by the bounding box generation unit 121.
To this end, the mask generation unit 122 may extract a region corresponding to the bounding box from the feature map and warp the same into a feature map having a preset resolution, for example, the resolution of 14×14. The mask generation unit 122 may acquire a convolution feature map 32 by applying a convolution layer to the warped feature map 31 and generate a max-pooled feature map 33a and an average-pooled feature map 33b by performing max pooling and average pooling on the convolution feature map 32.
Subsequently, the mask generation unit 122 may generate a combined feature map 34 by combining the max-pooled feature map 33a and the average-pooled feature map 33b and acquire an attention map 35 by applying a nonlinear function, e.g., a sigmoid function, to the combined feature map 34.
Subsequently, the mask generation unit 122 multiplies the attention map 35 and the convolution feature map 32, acquires an up-sampling result 37, e.g., an up-sampling result having the resolution of 28×28, by performing up-sampling on the multiplication result 36, and performs binary classification on the up-sampling result, thereby generating a mask 38.
However, the present disclosure is not limited to the method described with reference to 
  
Referring to 
Here, extracting the second feature map may include generating multiple candidate bounding boxes by applying a convolution layer to the first feature map at step S320 and generating multiple masks for the shapes of objects within the multiple candidate bounding boxes using the first feature map at step S330.
Here, generating the multiple candidate bounding boxes at step S320 may include applying the convolution layer of a first group to the first feature map and performing classification of a bounding box using a binary classifier; and predicting the bounding box by applying the convolution layer of a second group to the first feature map.
Here, predicting the bounding box may comprise setting offsets in multiple directions based on the center point of the object and estimating the location and size of the bounding box.
Here, generating the multiple candidate bounding boxes may comprise adjusting the confidence of the predicted bounding box based on a confidence score for the classification of the bounding box and centeredness indicating the degree of matching between the center of the predicted bounding box and that of ground truth (GT).
Here, generating the masks at step S330 may include extracting a region corresponding to the bounding box from the first feature map and warping the same into a feature map having a preset first resolution, acquiring a convolution feature map by applying a convolution layer to a warped feature map acquired as the result of warping, generating a max-pooled feature map and an average-pooled feature map by performing max pooling and average pooling on the convolution feature map, acquiring an attention map by combining the max-pooled feature map and the average-pooled feature map and applying a nonlinear function to the combination of the max-pooled feature map and the average-pooled feature map, acquiring an up-sampling result having a second resolution higher than the first resolution by multiplying the attention map and the convolution feature map and performing up-sampling on the multiplication result, and generating the mask by performing binary classification on the up-sampling result.
Here, extracting the first feature map at step S310 may include extracting multiple feature maps for respective layers from a backbone network, forming a feature pyramid for fusing information of the multiple feature maps for the respective layers by adding the extracted multiple feature maps for the respective layers in reverse order, and extracting the first feature map having multi-resolution for the image using the feature pyramid.
Hereinbelow, scene graph generation performance according to an embodiment will be described.
Table 1 and Table 2 show a result of comparing the performance of three models based on recall (Recall®20/50/100), mean recall (Mean Recall®20/50/100), and average precision (AP®50/75).
Here, the recall is the result of measuring the capability of finding True Positives (TP) among all predictions (TP+FN), and may be defined as shown in Equation (1) below:
  
    
  
In Equation (1), a True Positive (TP) means that a model predicts a label and it matches an answer correctly, and a False Negative (FN) means that a model does not predict a label but it is part of an actual measurement.
The average precision may be calculated as a weighted average of precisions, which are the rates of correct detection among all detection results at respective threshold values.
  
    
      
        
        
        
        
        
        
          
            
          
          
            
          
          
            
            
            
            
            
          
          
            
          
        
        
          
            
            
            
            
            
          
          
            
            
            
            
            
          
          
            
            
            
            
            
          
          
            
            
            
            
            
          
          
            
            
            
            
            
          
          
            
          
        
      
    
  
  
    
      
        
        
        
        
        
        
        
          
            
          
          
            
          
          
            
            
            
            
            
            
          
          
            
          
        
        
          
            
            
            
            
            
            
          
          
            
            
            
            
            
            
          
          
            
            
            
            
            
            
          
          
            
            
            
            
            
            
          
          
            
            
            
            
            
            
          
          
            
          
        
      
    
  
When the recall and mean recall are compared between the models, model B that changes only the backbone exhibits lower performance than the existing model (model A) because the recall value and the mean recall value decrease by 8.32/9.84/10.75 and 4.05/4.77/4.24. This indicates that model B does not show biased performance improvement in head class predicate results, compared to model A.
Also, model C proposed in the present disclosure is a model that changes both the backbone and the encoder, and the recall and the mean recall thereof increase by 13.36/13.40/13.45 and 4.61/4.75/3.60, compared to model B. In the case of AP, both model B and model C have improved performance, compared to model A. Also, unlike the recall, the AP of model B increases by 0.81/0.99/0.85. Also, the AP of model C increases by 9.60/14.57/10.00 and 8.79/13.58/9.14 compared to model A and model B. Consequently, when the proposed new method is applied to the backbone and encoder of the existing scene graph research model, rather than changing only the backbone, better performance improvement is exhibited in the three quantitative metrics.
  
The apparatus for scene graph generation according to an embodiment may be implemented in a computer system 1000 including a computer-readable recording medium.
The computer system 1000 may include one or more processors 1010, memory 1030, a user-interface input device 1040, a user-interface output device 1050, and storage 1060, which communicate with each other via a bus 1020. Also, the computer system 1000 may further include a network interface 1070 connected with a network 1080. The processor 1010 may be a central processing unit or a semiconductor device for executing a program or processing instructions stored in the memory 1030 or the storage 1060. The memory 1030 and the storage 1060 may be storage media including at least one of a volatile medium, a nonvolatile medium, a detachable medium, a non-detachable medium, a communication medium, or an information delivery medium, or a combination thereof. For example, the memory 1030 may include ROM 1031 or RAM 1032.
According to the disclosed embodiment, an encoder structure enabling efficient scene graph generation is proposed, whereby effective inference and understanding of the interrelationship between objects in scene information at a more detailed level may be realized. Accordingly, relationship information in scenes may be used in a smart city, smart manufacturing, and a smart environment as an application domain in which various intelligent agents are widely used.
Although embodiments of the present disclosure have been described with reference to the accompanying drawings, those skilled in the art will appreciate that the present disclosure may be practiced in other specific forms without changing the technical spirit or essential features of the present disclosure. Therefore, the embodiments described above are illustrative in all aspects and should not be understood as limiting the present disclosure.
| Number | Date | Country | Kind | 
|---|---|---|---|
| 10-2023-0075718 | Jun 2023 | KR | national |