This document generally relates to image search, and more particularly to text-to-image searches using neural networks.
An image retrieval system is a computer system for searching and retrieving images from a large database of digital images. The rapid increase of the number of photos taken by smart devices has incentivize further development in text-to-photo retrieval techniques to efficiently find a desired image from a massive amount of photo.
Disclosed are devices and methods for performing text-to-image searches. The disclosed techniques can be applied in various embodiments, such as mobile devices or cloud-based photo album services.
In one example aspect, a method for training an image search system is disclosed. The method includes obtaining classified features of the image using a neural network; determining, based on the classified features, local information that indicates a correlation between the classified features; and determining, based on the classified features, global information that indicates a correspondence between the classified features and one or more semantic categories. The method also includes deriving, based on a target semantic representation associated with the image, a semantic representation of the image by combining the local information and the global information.
In another example aspect, a method for performing an image searching is disclosed. The method includes receiving a textual search term from a user, determining a first semantic representation of the textual search term, and determining differences between the first semantic representation and multiple semantic representations that correspond to multiple of images. Each of the multiple semantic representations is determined based on combining local and global information of a corresponding image. The local information indicates a correlation between features of the corresponding image, and the global information indicates a correspondence between the features of the corresponding image and one or more semantic categories. The method also includes retrieving one or more images as search results in response to the textual search term based on the determined differences.
In another example aspect, a mobile device includes a processor, a memory including processor executable code, and a display. The processor executable code upon execution by the processor configures the processor to implement the described methods. The display is coupled to the processor configured to display search results to the user.
These and other features of the disclosed technology are described in the present document.
Smartphones nowadays can capture a large number of photos. The sheer amount of image data poses a challenge to photo album designs as a user may have gigabytes of photos stored on his or her phone and even more on a cloud-based photo album service. It is thus desirable to provide a search function that allows retrieval of the photos based on simple keywords (that is, text-to-image search) instead of forcing the user to scroll back and forth to find a photo showing a particular object or a person. However, unlike existing images on the Internet that provide rich metadata, user-generated photos typically include little or no meta information, making it more difficult to identify and/or categorize objects or people in the photos.
Currently, there are two common approaches to perform text-to-image searches. The first approach is based on learning using deep convolutional neural networks. The output layer of the neural network can have as many units as the number of classes of features in the image. However, as the number of classes grows, the distinction between classes blurs. It thus becomes difficult to obtain sufficient numbers of training images for uncommon target objects, which impacts the accuracy of the search results.
The second approach is based on image classification. The performance of image classification has recently witnessed a rapid progress due to the establishment of large-scale hand-labeled datasets. Many efforts have been dedicated to extending deep convolutional networks for single/multi-label image recognition. For image search applications, the search engine directly uses the labels (or the categories), predicted by trained classifier, as the indexed keywords for each photo. During a search stage, exact keyword matching is performed to retrieve photos having the same label as the user's query. However, this type of search is limited to predefined keywords. For example, users can get related photos using the query term “car” (which is one of the default categories in the photo album system) but may fail to obtain any results using the query term “vehicle” despite that “vehicle” is a synonym of “car.”
Techniques disclosed in this document can be implemented in various image search systems to allow the users to search through photos based on semantic correspondence between the textual keywords and the photos without requiring an exact match of the labels or categories. In such manner, the efficiency and accuracy of the image searches is improved. For example, users can use a variety of search terms, including synonyms or even brand names, to obtain desired search results. The search systems can also achieve more accuracy by leveraging both local and global information presented in the image datasets.
In some embodiments, the search system 100 includes a feature extractor 102 that can extract image features from the input images. The search system 100 also includes an information combiner 104 that combines global and local information in the extracted features and a multi-task learning module 106 to perform multi-label classification and semantic embedding at the same time.
In some embodiments, the feature extractor 102 can be implemented using a Convolutional Neural Network (CNN) that performs image classification given an input dataset. For example, Squeeze-and-Excitation ResNet 152 (SE-ResNet152), a Convolutional Neural Network (CNN) in image classification task on ImageNet dataset, can be leveraged as the feature extractor of the search system. The feature maps from the last convolutional layer of the CNN are provided as the input for the information combiner 104.
In some embodiments, inputs to the information combiner 104 are split into two streams: one stream for local/spatial information and the other stream for global information.
The local information provides correlation of spatial features within one image. Human visual attention allows us to focus on a certain region of an image while perceiving the surrounding image as a background. Similarly, more attention is given to certain groups of words (e.g., verbs and corresponding nouns) while less attention is given to the rest of the words in the sentence (e.g., adverbs and/or propositions). Attention in deep learning thus can be understood as a vector of importance weights. For example, a Multi-Head Self-Attention (MHSA) module can be used for local information learning. The MHSA module implements a multi-head self-attention operation, which assigns weights to indicate how much attention the current feature pays to the other features and obtains the representation that includes context information by a weighted summation. It is noted that while the MHSA module is provided herein as an example, other attention-based learning mechanisms, such as content-based attention or self-attention, can be adopted for local/spatial learning as well.
In the MHSA module, each point of the feature map can be projected into several Key, Query, and Value sub-spaces (which is referred to as “Multi-Head”). The module can learn the correlation by leveraging the dot product of Key and Query vectors. The output correlation scores from the dot product of Key and Query are then activated by an activation function (e.g., Softmax or Sigmoid function). The weighted encoding feature maps are obtained by multiplying the correlation scores with the Value vectors. The feature maps are then concatenated together from all the sub-spaces and then projecting back to the original space as the input of a spatial attention layer. The mathematical equations of MHSA can be defined as follows:
Here, a is the activation function (e.g., Softmax or Sigmoid function) and W° is the weight of back-projection from multi-head sub-space to the original space. Eq. (1) is the definition of attention and Eq. (2) defines the Multi-Head Self-Attention operation.
The spatial attention layer can enhance the correlation of feature patterns and the corresponding labels. For example, the weighted feature maps from the MHSA layer can be mapped to a score vector using the spatial attention layer. The weighted vectors (e.g., context vectors) thus include both intra-relationship between different objects and inter-relationship between objects and labels. The spatial attention layer can be described as follows:
SPAttention=σ(MHSA×WSP) Eq. (3)
Context=(SPAttension·MHSA) Eq. (4)
Here, a is the activation function (e.g., Softmax or Sigmoid function) and WSP is the weight of spatial attention layer. The context vector in Eq. (4) can also be called weighted encoding attention vector.
For global information stream, a global pooling layer can be used to process the outputs of the classification neural network (e.g., the last convolutional layer of the CNN). One advantage of global pooling layer is that it can enforce correspondences between feature maps and categories. Thus, the feature maps can be easily interpreted as categories confidence maps. Another advantage is that overfitting can be avoided at this layer. After the pooling operation, a dense layer with a Sigmoid function can be applied to obtain a global information vector. Each element of the vector can thus be viewed as a probability. The global information vector can be defined as:
Global=σ(GP×WGP) Eq. (5)
Here, the σ is the Sigmoid function, GP is the output of global pooling and the WGP is the weight of dense layer.
The global information and the local attention are then combined jointly to improve the accuracy of the learning and subsequent searches. In some embodiments, an element-wise product (e.g., Hadamard Product) can be used to combine the global and local information. The encoded information vector can be represented as:
Encoded=Global⊙Context Eq. (6)
Here, ⊙ is the Hadamard product. The element-wise product is selected because both global information and spatial attention are from the same feature map. Therefore, the local information (e.g., spatial attention score vector) can be treated as a guide to weigh the global information (e.g., the global weighted vector). For instance, when an image includes labels or categories like “scenery”, “grassland” and “mountain,” the probability of having related elements (e.g., “sheep” and/or “cattle”) in the same image may also be high. However, the spatial attention vector can emphasize “grassland” and “mountain” areas so that the global information provides a higher probability for elements that are a combination of “grassland” and “mountain” while decreasing the probability for “sheep” or “cattle” as no relevant objects are shown in the image.
The combined information vector obtained from abovementioned steps is then fed to the multi-task learning module 106 as the input of both classification layer and semantic embedding layer. The classification layer can output a vector that has the same dimension as the number of categories of the input dataset, which can also be activated by a Sigmoid function. In some embodiments, a weighted Binary Cross-Entropy (BCE) loss function is implemented for the multi-label classification, which can be presented as follows:
Lossc=a·(Y log({tilde over (Y)})+b·((1−Y)log(1−{tilde over (Y)})) Eq. (7)
Here, a and b are the weights for positive and negative samples respectively. The Y and {tilde over (Y)} are the ground truth labels and the predicted labels respectively.
In some embodiments, for semantic embedding, an image can be randomly selected as the target embedding vector to learn the image-sentence pairs. In some embodiments, a Cosine Similarity Embedding Loss function is used for learning the semantic embedding vectors. For example, the target ground truth embedding vectors, i.e., the target vector, can be obtained from a pretrained Word2Vec model. The Cosine Similarity Embedding Loss function can be described as:
Here, Z and {tilde over (Z)} are the target word embedding vectors and generated semantic embedding vectors and the margin is the value of controlling the dissimilarity which can be set from [−1, 1]. The Cosine Similarity Embedding Loss function tries to force the embedding vectors to approach the target vector if they are from the same category and to push them further from each other if they are from different categories.
At offline training stage, all photos in a user's photo album can be indexed via the visual-semantic embedding techniques as described above. For example, when the user captures a new photo, the system first extracts features of the image and then transforms the features to one or more vectors corresponding to the semantic meanings. At search time, when the user provides a text query, the system computes the corresponding vector of the text query and searches for the images having closest corresponding semantic vectors. Top-ranked photos are then returned as the search results. Thus, given a set of photos in a photo album and a query term, the search system can locate related images in the photo album that have semantic correspondence with the given text term, even when the term does not belong to any pre-defined categories.
Furthermore, using the disclosed techniques, it is possible to obtain fuzzy search results based on semantically related concepts. For example, piggy banks are not directly related to the term “deposit” but offer a similar semantic meaning. As shown in
In some embodiments, the method includes splitting the classified features to a number of streams. For example, the classified features are input to two streams of the information combiner module. The local information is determined based on a first stream, i.e., the stream for local/spatial information, and the global information is determined based on a second stream, i.e., the stream for global information. In some embodiments, the local information is determined based on a multi-head self-attention operation. For example, the local information may be determined by performing the multi-head self-attention operation on the classified features. In some embodiments, the local information is represented as one or more weighted vectors indicating the correlation between the classified features. In some embodiments, the global information is determined based on a global pooling operation. For example, the global information may be determined by performing the global pooling operation on the classified features. In some embodiments, the global information is represented as one or more weighted vectors based on results of the global pooling operation. In some embodiments, the local information and the global information are represented as vectors, and the local information and the global information are combined by performing an element-wise product of the vectors. In some embodiments, the element-wise product refers to a Hadamard product.
In some embodiments, deriving the semantic representation of the image includes determining one or more semantic labels that correspond to the one or more semantic categories based on a first loss function. In some embodiments, the first loss function includes a weighted cross entropy loss function. In some embodiments, the semantic representation of the image is derived based on a second loss function that reduces a difference between the semantic representation and the target semantic representation. In some embodiments, the second loss function includes a Cosine similarity function. In some embodiments, a multi-label classification and a semantic embedding are simultaneously performed by using a multi-task learning module.
In some embodiments, the local information of the corresponding image is determined based on classifying the features of the corresponding image using a neural network and performing a multi-head self-attention operation on the features. In some embodiments, the local information is represented as one or more weighted vectors indicating the correlation between the features. In some embodiments, the global information of the corresponding image is determined based on classifying the features of the corresponding image using a neural network and performing a global pooling operation on the features. In some embodiments, the global information is represented as one or more weighted vectors based on results of the global pooling operation. In some embodiments, the local information and the global information are represented as vectors, and the local information and the global information are combined as an element-wise product of the vectors. In some embodiments, the element-wise product refers to a Hadamard product. In some embodiments, determining the differences between the first semantic representation and the multiple semantic representations includes calculating a Cosine similarity between the first semantic representation and each of the multiple of semantic representations. The calculated cosine similarity is taken as the difference. In some embodiments, one or more images with high semantic similarities are selected as the search results in response to the textual search term, and the one or more images are displayed to the user.
In some embodiments, a non-transitory computer-program storage medium is provided. The computer-program storage medium includes code stored thereon. The code, when executed by a processor, causes the processor to implement the described method.
In some embodiments, an image retrieval system includes one or more processors, and a memory including processor executable code. The processor executable code upon execution by at least one of the one or more processors configures the at least one processor to implement the described methods.
The processor(s) 605 may include central processing units (CPUs) to control the overall operation of, for example, the host computer. The processor(s) 605 can also include one or more graphics processing units (GPUs). In certain embodiments, the processor(s) 605 accomplish this by executing software or firmware stored in memory 610. The processor(s) 605 may be, or may include, one or more programmable general-purpose or special-purpose microprocessors, digital signal processors (DSPs), programmable controllers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), or the like, or a combination of such devices.
The memory 610 can be or include the main memory of the computer system. The memory 610 represents any suitable form of random access memory (RAM), read-only memory (ROM), flash memory, or the like, or a combination of such devices. In use, the memory 610 may contain, among other things, a set of machine instructions which, upon execution by processor 605, causes the processor 605 to perform operations to implement embodiments of the presently disclosed technology.
Also connected to the processor(s) 605 through the interconnect 625 is a (optional) network adapter 615. The network adapter 615 provides the computer system 600 with the ability to communicate with remote devices, such as the storage clients, and/or other storage servers, and may be, for example, an Ethernet adapter or Fiber Channel adapter.
The disclosed techniques can allow an image search system to better capture multi-objects spatial relationship in an image. The combination of the local and global information in the input image can enhance the accuracy of the derived spatial correlation among features and between features and the corresponding semantic categories. As compared to existing techniques that directly use the summation of vectors of all labels (e.g., categories), where the summed vector can potentially lose the original meaning in the semantic space, the disclosed techniques avoid changing the semantic meaning of each label. The learned semantic embedding vectors thereby include both the visual information of images and the semantic meaning of labels.
Implementations of the subject matter and the functional operations described in this patent document can be implemented in various systems, digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a tangible and non-transitory computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The term “data processing unit” or “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
In some embodiments, a mobile device is provided. As illustrated in
In some embodiments, the method includes the operations as follows. A textual search term from a user is received. A first semantic representation of the textual search term is determined. Differences between the first semantic representation and multiple semantic representations that correspond to the images are determined. Based on the determined differences, one or more images are retrieved as search results in response to the textual search term. Each of the number of semantic representations is determined based on combining local information and global information of a corresponding image, the global information indicates a correspondence between features of the corresponding image and one or more semantic categories, and the local information indicates a correlation between at least two of the features of the corresponding image.
It is intended that the specification, together with the drawings, be considered exemplary only, where exemplary means an example.
While this patent document contains many specifics, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.
Only a few implementations and examples are described and other implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.
This application is a continuation of International Application No. PCT/CN2020/128459, filed Nov. 12, 2020, which claims priority to U.S. Application No. 62/939,135, filed Nov. 22, 2019, the entire disclosures of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62939135 | Nov 2019 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/128459 | Nov 2020 | US |
Child | 17749983 | US |