In present practice, a user typically interacts with a text-based query-processing engine by submitting one or more text-based queries. The text-based query-processing engine responds to this submission by identifying a set of websites containing text that matches the query(ies). Alternatively, in a separate search session, a user may interact with an image-based query-processing engine by submitting a single fully-formed input image as a search query. The image-based query-processing engine responds to this submission by identifying one or more candidate images that match the input image. These technologies, however, are not fully satisfactory. A user may have difficulty expressing his or her search intent in words. For different reasons, a user may have difficulty finding an input image that adequately captures his or her search intent. Image-based searches have other limitations. For example, a traditional image-based query-processing engine does not permit a user to customize an image-based query. Nor does it allow the user to revise a prior image-based query.
A computer-implemented technique is described herein for using both text content and image content to retrieve query results. The technique allows a user to more precisely express his or her search intent compared to the case in which a user submits an image or an instance of text by itself. This, in turn, enables the user to quickly and efficiently identify relevant search results.
According to one illustrative aspect, the user submits an input image at the same time as an instance of text. Alternatively, the user submits the input image and text in different respective dialogue turns of a query session.
According to another illustrative aspect, the technique uses a text analysis engine to identify one or more characteristics of the input text, to provide text information. The technique uses an image analysis engine to identify at least one object depicted by the input image, to provide image information. In a text-based retrieval path, the technique combines the text information with the image information to generate a reformulated text query. For instance, the technique performs this task by replacing an ambiguous term in the input text with a term obtained from the image analysis engine, or by appending a term obtained from the image analysis engine to the input text. The technique then submits the reformulated text query to a text-based query-processing engine. In response, the text-based query-processing engine returns the query results to the user.
Alternatively, or in addition, the technique can use insight extracted from the input text to guide the manner in which it processes the input image. For example, in an image-based retrieval path, the image analysis engine can use an image-based retrieval engine to convert the input image into a latent semantic vector, and then use the latent semantic vector in combination with the text information (produced by the text analysis engine) to provide the query results. Those query results correspond to candidate images that resemble the input image and that match attribute information extracted from the text information.
In another implementation, the technique generates query results based on an image submitted by the user together with information provided by some other mode of expression besides text.
The above-summarized technique can be manifested in various types of systems, devices, components, methods, computer-readable storage media, data structures, graphical user interface presentations, articles of manufacture, and so on.
This Summary is provided to introduce a selection of concepts in a simplified form; these concepts are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The same numbers are used throughout the disclosure and figures to reference like components and features. Series 100 numbers refer to features originally found in
This disclosure is organized as follows. Section A describes a computing system for generating and submitting a plural-mode query to a query-processing engine, and, in response, receiving query results from the query-processing engine. Section B sets forth illustrative methods that explain the operation of the computing system of Section A. And Section C describes illustrative computing functionality that can be used to implement any aspect of the features described in Sections A and B.
As a preliminary matter, the term “hardware logic circuitry” corresponds to one or more hardware processors (e.g., CPUs, GPUs, etc.) that execute machine-readable instructions stored in a memory, and/or one or more other hardware logic units (e.g., FPGAs) that perform operations using a task-specific collection of fixed and/or programmable logic gates. Section C provides additional information regarding one implementation of the hardware logic circuitry. In some contexts, each of the terms “component” and “engine” refers to a part of the hardware logic circuitry that performs a particular function.
In one case, the illustrated separation of various parts in the figures into distinct units may reflect the use of corresponding distinct physical and tangible parts in an actual implementation. Alternatively, or in addition, any single part illustrated in the figures may be implemented by plural actual physical parts. Alternatively, or in addition, the depiction of any two or more separate parts in the figures may reflect different functions performed by a single actual physical part.
Other figures describe the concepts in flowchart form. In this form, certain operations are described as constituting distinct blocks performed in a certain order. Such implementations are illustrative and non-limiting. Certain blocks described herein can be grouped together and performed in a single operation, certain blocks can be broken apart into plural component blocks, and certain blocks can be performed in an order that differs from that which is illustrated herein (including a parallel manner of performing the blocks). In one implementation, the blocks shown in the flowcharts that pertain to processing-related functions can be implemented by the hardware logic circuitry described in Section C, which, in turn, can be implemented by one or more hardware processors and/or other logic units that include a task-specific collection of logic gates.
As to terminology, the phrase “configured to” encompasses various physical and tangible mechanisms for performing an identified operation. The mechanisms can be configured to perform an operation using the hardware logic circuitry of Section C. The term “logic” likewise encompasses various physical and tangible mechanisms for performing a task. For instance, each processing-related operation illustrated in the flowcharts corresponds to a logic component for performing that operation. A logic component can perform its operation using the hardware logic circuitry of Section C. When implemented by computing equipment, a logic component represents an electrical element that is a physical part of the computing system, in whatever manner implemented.
Any of the storage resources described herein, or any combination of the storage resources, may be regarded as a computer-readable medium. In many cases, a computer-readable medium represents some form of physical and tangible entity. The term computer-readable medium also encompasses propagated signals, e.g., transmitted or received via a physical conduit and/or air or other wireless medium, etc. However, the specific term “computer-readable storage medium” expressly excludes propagated signals per se, while including all other forms of computer-readable media.
The following explanation may identify one or more features as “optional.” This type of statement is not to be interpreted as an exhaustive indication of features that may be considered optional; that is, other features can be considered as optional, although not explicitly identified in the text. Further, any description of a single entity is not intended to preclude the use of plural such entities; similarly, a description of plural entities is not intended to preclude the use of a single entity. Further, while the description may explain certain features as alternative ways of carrying out identified functions or implementing identified mechanisms, the features can also be combined together in any combination. Finally, the terms “exemplary” or “illustrative” refer to one implementation among potentially many implementations.
A. Illustrative Computing System
More specifically, at time t1, the user submits an image and an instance of text to a query-processing engine as part of the same turn. At times t2 and t3, the user separately submits two respective instances of text, unaccompanied by image content. At time t4, the user submits another image to the query-processing engine, without any text content. At time t5, the user submits another instance of text, unaccompanied by any image content. And at time t6, the user again submits an image and an instance of text as part of the same turn.
The user may provide text content using different input devices. In one case, the user may supply text by typing the text using any type of key input device. In another case, the user may supply text by selecting a phrase in an existing document. In another case, the user may supply text in speech-based form. A speech recognition engine then converts an audio signal captured by a microphone into text. Any subsequent reference to the submission of text is meant to encompass at least the above-identified modes of supplying text. The user may similarly provide image content using different kinds of input devices, described in greater detail below.
A computing system (described below in connection with
In other examples, a user may perform a plural-mode search operation by combining input image content with any other form of expression, not limited to text. For example, a user may perform a plural-mode search operation by supplementing an input image with any type of gesture information captured by a gesture recognition component. For instance, gesture information may indicate that the user is pointing to a particular part of the input image; this conveys the user's focus of interest within the image. In another case, a user may perform a plural-mode search operation by supplementing an input image with gaze information captured by a gaze capture mechanism. The gaze information may indicate that the user is focusing on a particular part of the input image. However, to simplify explanation, this disclosure will mostly present examples in which a user performs a plural-mode search operation by combining image content with text context.
The user interface presentation 204 allows a user to enter information using two or more forms of expression, on the basis of which the computing system performs a plural-mode search operation. In the example of
The above example is merely illustrative. In other cases, the user can interact with the computing system using other kinds of graphical control elements compared to that depicted in
In the specific example of
Although not shown, a user may alternatively input text by using a key input mechanism provided by the computing device 206. Further, the user interface presentation 204 can include a graphical control element 214 that invites the user to select a preexisting image in a corpus of preexisting images, or to select a preexisting image within a web page or other source document or source content item. The user may choose a preexisting image in lieu of, or in addition to, an image captured in real time by the digital camera. In yet another variation (not shown), the user may opt to capture and submit two or more images of an object in a single session turn.
In response to the submitted text content and image content, a query-processing engine returns query results and displays the query results on a user interface presentation 216 provided by an output device. For instance, the computing device 206 may display the query results below the user interface presentation 204. For the illustrative case in which the query-processing engine performs a text-based web search, the query results summarize websites for stores that purport to sell the product illustrated in the image 210. More specifically, in one implementation, the query results include a list of result snippets. Each result snippet can include a text-based and/or an image-based summary of content conveyed by a corresponding website (or other matching document).
In the above examples, the query-processing engine corresponds to a text-based search engine that performs a text-based web search to provide the query results. The text-based web search leverages insight extracted from the input image. Alternatively, or in addition, the query-processing engine corresponds to an image-based retrieval engine that retrieves images based on the user's plural-mode input information. More specifically, in an image-based retrieved mode, the computing system uses insight from the text analysis (performed on an instance of input text) to assist an image-based retrieval engine in retrieving appropriate images.
In general, the user can more precisely describe his or her search intent by using plural modes of expression. For example, in the example of
One technical merit of the above-described solution is that it allows a user to more efficiently convey his or her search intent to the query-processing engine. For example, in some cases, the computing system allows the user to obtain useful query results in fewer session turns, compared to the case of conducting a pure text-based search or a pure image-based search. This advantage offers good user experience and makes more efficient use of system resources (insofar as the length of a search session has a bearing on an amount of processing, storage, and communication resources that will be consumed by the computing system in providing the query session).
As summarized above, the query-processing engine 504 can encompass different mechanisms for retrieving query results. In a text-based path, the query-processing engine 504 uses a text-based search engine and/or a text-based question-answering (Q&A) engine to provide the query results. In an image-based path, the query-processing engine 504 uses an image-based retrieval engine to provide the query results. In this case, the image results include images that the image-based retrieval engine deems similar to an input image.
An input capture system 506 provides a mechanism by which a user may provide input information for use in performing a plural-mode search operation. The input capture system 506 incudes plural input devices, including, but not limited to, a speech input device 508, a text input device 510, an image input device 512, etc. The speech input device 508 corresponds to one or more microphones in conjunction with a speech recognition component. The speech recognition component can use any machine-learned model to convert an audio signal provided by the microphone(s) into text. For instance, the speech recognition component can use a recurrent neural network (RNN) that is composed of long short-term memory (LSTM) units, a hidden Markov model (HMI), etc. The text input device 510 can correspond to a key input device with physical keys, a “soft” keyboard on a touch-sensitive display device, etc. The image input device 512 can correspond to one more digital cameras for capturing still images, one or more video cameras, one or more depth camera devices, etc.
Although not shown, the input devices can also include mechanisms for inputting information in a form other than text or image. For example, another input device can correspond to a gesture-recognition component that determines when the user has performed a hand or body movement indicative of a telltale gesture. The gesture-recognition component can receive image information from the image input device 512. It can then use any pattern-matching algorithm or machine-learned model(s) to detect telltale gestures based on the image information. For example, the gesture-recognition component can use an RNN to perform this task. Another input device can correspond to a gaze capture mechanism. The gaze capture mechanism may operate by projecting light onto the user's eyes and capturing the glints reflected from the user's eyes. The gaze capture mechanism can determine the directionality of the user's gaze based on detected glints.
However, to simplify the explanation, assume that a user interacts with the input capture system 506 to capture an instance of text and a single image. Further assume that the user enters these two input items in a single turn of a query session. But as pointed out with respect to
The input capture system 506 can also include a user interface (UI) component 514 that provides one or more user interface (UI) presentations through which the user may interact with the above-described input devices. For example, the UI component 514 can provide the kinds of UI presentations shown in
A text analysis engine 516 performs analysis on the input text provided by the input capture system 506, to provide text information. An image analysis engine 518 performs analysis on the input image provided by the input capture system 506, to provide image information.
More specifically, in one implementation, the image analysis engine 518 can use an image classification component 520 to classify the object(s) in the input image with respect to a set of pre-established object categories. It performs this task using one or more machine-trained classification models. Alternatively, or in addition, an image-based retrieval engine 522 can first use an image encoder component to convert the input image into at least one latent semantic vector, referred to below as a query semantic vector. The image-based retrieval engine 522 then uses the query semantic vector to find on more candidate images that match the input image. More specifically, the image-based retrieval engine 522 can consult an index provided in a data store 524 that provides candidate semantic vectors associated with respective candidate images. The image-based retrieval engine 522 finds those candidate semantic vectors that are closest to the query semantic vector, as measured by any vector-space distance metric (e.g., cosine similarity). Those nearby candidate semantic vectors are associated with matching candidate images.
The text information provided by the text analysis engine 516 can include the original text submitted by the user together with analysis results generated by the text analysis engine 516. The text analysis results can include domain information, intent information, slot information, part-of-speech information, parse-tree information, one or more text-based semantic vectors, etc. The image information supplied by the image analysis engine 518 can include the original image content together with object information generated by the image analysis engine 518. The object information generally describes the objects present in the image. For each object, the object information can specifically include: (1) a classification label and/or any other classification information provided by the image classification component 520; (2) bounding box information provided by the image classification component 520 that specifies the location of the object in the input image; (3) any textual information provided by the image-based retrieval engine 522 (that it extracts from the index in the data store 524 upon identifying matching candidate images); (4) an image-based semantic vector for the object (that is produced by the image-based retrieval engine 522), etc. In addition, the image analysis engine 518 can include an optical character recognition (OCR) component (not shown). The image information can include textual information extracted by the OCR component.
Although not shown, the computing system 502 can incorporate additional analysis engines in the case in which the user supplies an input expression in a form other than text or image. For example, the computing system 502 can include a gesture analysis engine for providing gesture information based on analysis of a gesture performed by the user.
The computing system 502 can provide query results using either a text-based retrieval path or an image-based retrieval path. In the text-based retrieval path, a query expansion component 526 generates a reformulated text query by using the image information generated by the image analysis engine 518 to supplement the text information provided by the text analysis engine 516. The query expansion component 526 can perform this task in different ways, several of which are described below in connection with
In the above example, the information produced by the image analysis engine 518 supplements the text information provided by the text analysis engine 516. In addition, or alternatively, the work performed by the image analysis engine 518 can benefit from the analysis performed by the text analysis engine 516. For example, consider the scenario of
Alternatively, or in addition, the image classification component 520 and/or the image-based retrieval engine 522 can directly leverage text information produced by the text analysis engine 516. Path 530 conveys this possible influence. For example, the image classification component 520 can use a text-based semantic vector as an additional feature (along with image-based features) in classifying the input image. The image-based retrieval engine 522 can use a text-based semantic vector (along with an image-based semantic vector) to find matching candidate images.
After generating the reformulated text query, the computing system 502 submits it to the query-processing engine 504. In one implementation, the query-processing engine 504 corresponds to a text-based search engine 532. In other cases, the query-processing engine 504 corresponds to a text-based question-answering (Q&A) engine 534. In either case, the query-processing engine 504 provides query results to the user in response to the submitted plural-mode query.
The text-based search engine 532 can use any search algorithm to identify candidate documents (such as websites) that match the reformulated query. For example, the text-based search engine 532 can compute a query semantic vector based on the reformulated query, and then use the query semantic vector to find matching candidate documents. It performs this task by finding nearby candidate semantic vectors in an index (in a data store 536). The text-based search engine 532 can assess the relation between two semantic vectors using any distance metric, such as Euclidean distance, cosine similarity, etc. In other words, in one implementation, the text-based search engine 532 can find matching candidate documents in the same manner that the image-based retrieval engine 522 finds matching candidate images.
In one implementation, the Q&A engine 534 can provide a corpus of pre-generated questions and associated answers in a data store 538. The Q&A engine 534 can find the question in the data store that most closely matches the submitted the reformulated query. For instance, the Q&A engine 534 can perform this task using the same technique as the search engine 532. The Q&A engine 534 can then deliver the pre-generated answer that is associated with the best-matching query.
In the image-based retrieval path, the computing system 502 relies on the image analysis engine 518 itself to generate the query results. In other words, in the text-based retrieval path, the image analysis engine 518 serves a support role by generating image information that assists the query expansion component 526 in reformulating the user's input text. But in the image-based retrieval path, the image-based retrieval engine 522 and/or the image classification component 520 provide an output result that represents the final outcome of the plural-mode search operation. That output result may include a set of candidate images provided by the image-based retrieval engine 522 that are deemed similar to the user's input image, and which also match aspects of the user's input text. Alternatively, or in addition, the output result may include classification information provided by the image classification component 520.
The operation of the image-based retrieval engine 522 in connection with the image-based retrieval path will be clarified below in conjunction with the explanation of
In yet another mode, the computing system 502 can generate query results by performing both a text-based search operation and an image-based search operation. In this case, the query results can combine information extracted from the text-based search engine 532 and the image-based retrieval engine 522.
An optional search mode selector 540 determines what search mode should be invoked when the user submits a plural-mode query. That is, the search mode selector 540 determines whether the text-based search path should be used, or the image-based search path, or both the text-based and image-based search paths. The search mode selector 540 can make this decision based on the text information (provided by the text analysis engine 516) and/or the image information (provided by the image analysis engine 518). In one implementation, for instance, the text analysis engine 516 can include an intent determination component that classifies the intent of the user's plural-mode query based on the input text. The search mode selector 540 can choose the image-based search path when the user's input text indicates that he or she wishes to retrieve images that have some relation to the input image (as when the user inputs the text, “Show me this dress, but in blue”). The search mode selector 540 can choose the text-based search path when the user's input indicates that the user has a question about an object depicted in an image (as when the user inputs the text, “Can I put this dress in the washer?”).
In one implementation, the search mode selector 540 performs its function using a set of discrete rules, which can be formulated as a lookup table. For example, the search mode selector 540 can invoke the text-based retrieval path when the user's input text includes a key phrase indicative of his or her attempt to discover where he can buy a product depicted in an image (such as the phrase “Where can I get this,” etc.). The search mode selector 540 can invoke the image-based retrieval path when the user's input text includes a key phrase indicative of the user's desire to retrieve images (such as the phrase “Show me similar items,” etc.). In another example, the search mode selector 540 generates a decision using a machine-trained model of any type, such as a convolutional neural network (CNN) that operates based on an n-gram representation of the user's input text.
In other cases, the search mode selector 540 can take the image information generated by the image analysis engine 518 into account when deciding what mode to invoke. For example, users may commonly apply an image-based search mode for certain kinds of objects (such as clothing items), and a text-based search mode for other kinds of objects (such as storefronts). The search mode selector 540 can therefore apply knowledge about the kind of objects in the input image (which it gleans from the image analysis engine 518) in deciding the likely intent of the user.
The response-generating component 606 provides environment-specific logic for mapping the user's reformulated text query into a response. In the example of
The NLG component 608 maps the output of the response-generating component 606 into output text. For example, the response-generating component 606 can provide output information in parametric form. The NLG component 608 can map this output information into human-understandable output text using a lookup table, a machine-trained model, etc.
The functionality of the computing system 502 can be distributed between the servers 704 and the user computing devices 706 in any manner. In one implementation, the servers 704 implement all functions of the computing system 502 of
In the case of
In
In the case of
In
The above examples in
In yet other examples, the computing system 502 can invoke both the text-based retrieval path and the image-based retrieval path. In that case, the query results may include a mix of result snippets associated with websites and candidate images.
As a general principle, the computing system 502 can intermesh text analysis and image analysis in different ways based on plural factors, including the environment-specific configuration of the computing system 502, the nature of the input information, etc. The examples set forth in
The semantic analysis component 1404 can optionally include a domain determination component, an intent determination component, and a slot value determination component. The domain determination component determines the most probable domain associated with a user's input query. A domain pertains to the general theme to which an input query pertains, which may correspond to a set of tasks handled by a particular application, or a subset of those tasks. For example, the input command “find Mission Impossible” pertains to a media search domain. The intent determination component determines an intent associated with a user's input query. An intent corresponds to an objective that a user likely wishes to accomplish by submitting an input message. For example, a user who submits the command “buy Mission Impossible” intends to purchase the movie “Mission Impossible.” The slot value determination component determines slot values in the user's input query. The slot values correspond to information items that an application needs to perform a requested task, upon interpretation of the user's input query. For example, the command “find Jack Nicolson movies in the comedy genre” includes a slot value “Jack Nicolson” that identifies an actor having the name of “Jack Nicolson,” and a slot value “comedy,” corresponding to a requested genre of movies.
The above-summarized components can use respective machine-trained components to perform their respective tasks. For example, the domain determination component may apply any machine-trained classification model, such as a deep neural network (DNN) model, a support vector machine (SVM) model, and so on. The intent determination component can likewise use any machine-trained classification model. The slot value determination component can use any machine-trained sequence-labeling model, such as a conditional random fields (CRF) model, an RNN model, etc. Alternatively, or in addition, the above-summarized components can use rules-based engines to perform their respective tasks. For example, the intent determination component can apply a rule that indicates that any input message that matches the template “purchase <x>” refers to an intent to buy a specified article, where that article is identified by the value of variable x.
In addition, or alternatively, the semantic analysis component 1404 can include a text encoder component that maps the input text into a text-based semantic vector. The text encoder component can perform this task using a convolutional neural network (CNN). For example, the CNN can convert the text information into a collection of n-gram vectors, and then map those n-gram vectors into a text-based semantic vector.
The semantic analysis component 1404 can include yet other subcomponents, such as a named entity recognition (NER) component. The NER component identifies the presence of terms in the input text that are associated with objects-of-interest, such as particular people, places, products, etc. The NER component can perform this task using a dictionary lookup technique, a machine-trained model, etc.
An offline index-generating component (not shown) can produce the information stored in the index in the data store 1506. In that process, the index-generating component can use the image encoder component 1502 to compute at least one latent semantic vector for each candidate image. The index-generating component stores these vector(s) in an entry in the index associated with the candidate image. The index-generating component can also store one or more textual attributes pertaining to each candidate image. The index-generating component can extract these attributes from various sources. For instance, the index-generating component can extract label information from a caption that accompanies the candidate image (e.g., for the case in which the candidate image originates from a website or document that includes both an image and its caption). In addition, or alternatively, the index-generating component can use the image classification component 520 to classify objects in the image, from which additional label information can be obtained.
As noted above, in some contexts, the image-based retrieval engine 522 serves a support role in a text-based retrieval operation. For instance, the image-based retrieval engine 522 provides image information that allows the query expansion component 526 to reformulate the user's input text. In another context, the image-based retrieval engine 522 serves the primary role in an image-based retrieval operation. In that case, the candidate images identified by the image-based retrieval engine 522 correspond to the query results themselves.
A supplemental input component 1508 serves a role that is particularly useful in the image-based retrieval path. This component 1508 receives text information from the text analysis engine 516 and (optionally) image information from the image classification component 520. It maps this input information into attribute information. The image-based search engine 1504 uses the attribute information in conjunction with the latent semantic vector(s) provided by the image encoder component 1502 to find the candidate images. For example, consider the example of
The supplemental input component 1508 can map text information to attribute information using any techniques. In one case, the supplemental input component 1508 uses a set of rules to perform this task, which can be implemented as a lookup table. For example, the supplemental input component 1508 can apply a rule that causes it to extract any color-related word in the input text as an attribute. In another implementation, the supplemental input component 1508 uses a machine-trained model to perform its mapping function, e.g., by using a sequence-to-sequence RNN to map input text information into attribute information.
In other cases, the user's input text can reveal the user's focus of interest within an input image that contains plural objects. For example, the user's input image can show the full body of a model. Assume that the user's input text reads “Show me jackets similar to the one this person is wearing.” In this case, the supplemental input component 1508 can provide attribute information that includes the word “jacket.” The image-based search component 1504 can leverage this attribute information to eliminate or demote any candidate image that is not tagged with the word “jacket” in the index. This will operate to exclude images that show only pants, shoes, etc.
Advancing to
In each convolution operation, a convolution component moves an n×m kernel (also known as a filter) across an input image (where “input image” in this general context refers to whatever image is fed to the convolutional component). In one implementation, at each position of the kernel, the convolution component generates the dot product of the kernel values with the underlying pixel values of the image. The convolution component stores that dot product as an output value in an output image at a position corresponding to the current location of the kernel. More specifically, the convolution component can perform the above-described operation for a set of different kernels having different machine-learned kernel values. Each kernel corresponds to a different pattern. In early layers of processing, a convolutional component may apply kernels that serve to identify relatively primitive patterns (such as edges, corners, etc.) in the image. In later layers, a convolutional component may apply kernels that find more complex shapes.
In each pooling operation, a pooling component moves a window of predetermined size across an input image (where the input image corresponds to whatever image is fed to the pooling component). The pooling component then performs some aggregating/summarizing operation with respect to the values of the input image enclosed by the window, such as by identifying and storing the maximum value in the window, generating and storing the average of the values in the window, etc.
A fully-connected component can begin its operation by forming a single input vector. It can perform this task by concatenating the rows or columns of the input image (or images) that are fed to it, to form a single input vector. The fully-connected component then feeds the input vector into a first layer of a fully-connected neural network. Generally, each layer j of neurons in the neural network produces output values zj given by the formula zj=ƒ(Wjzj-1+bj), for j=2, . . . N. The symbol j−1 refers to a preceding layer of the neural network. The symbol Wj denotes a machine-learned weighting matrix for the layer j, and the symbol bj refers to a machine-learned bias vector for the layer j. The activation function ƒ(⋅) can be formulated in different ways, such as a rectified linear unit (ReLU).
First, a cropping filter component 2002 (“filter component”) operates in those circumstances in which the user's input text is directed to a specific part of an image, rather than the image as a whole. For example, assume that the user provides the input text “What is this?” Here, it is apparent that the user is interested in the principal object captured by an accompanying image. In another case, assume that the user provides the input text “Where can I buy that jacket he is wearing?” or “Who is that person standing farthest to the left?” The filter component 2002 operates in this case by using the input text to select a particular part of the image information (provided by the image analysis engine 518) that is relevant to the user's current focus of interest, while optionally ignoring the remainder of the image information. For example, upon concluding that the user is interested in a particular person in an image that contains plural people, the filter component 2002 can select the labels, keywords, etc. associated with this person, and ignore the textual information regarding other people in the image.
The filter component 2002 can operate by applying context-specific rules. One such rule determines whether any term in the input text is co-referent with any term in the image information. For example, assume that the user's input text reads “What kind of tree is that?”, while the image shows plural objects, such as a tree, a person, and a dog. Further assume that the image analysis engine 2002 recognizes the tree and provides the label “elm tree” in response thereto, along with labels associated with other objects in the image. The filter component 2002 can determine that the word “tree” in the input text matches the word “tree” in the image information. In response to this insight, the filter component 2002 can retain the part of the image information that pertains to the tree, while optionally discarding the remainder that is not relevant to the user's input text.
Another rule determines whether: (1) the image analysis engine 518 identifies plural objects; and (2) the input text includes positional information indicative of the user's interest in part of the input image. If so, then the filter component 2002 can use the positional information to retain the appropriate part of the image information, while discarding the remainder. For example, assume that the input image shows two men standing side-by-side. And assume that the user's input text reads “Who is the man on the left?” In response, the filter component 2002 can select object information pertaining to the person who is depicted on the left side of the image.
Still more complex rules can be used that combine aspects of the above two kinds of rules. For example, the user's text may read “Who is standing to the left of the woman with red hair?” The filter component 2002 can first consult the image information to identify the object corresponding to a woman with red hair. The filter component 2002 can then consult the image information to find the object that is positioned to the left of the woman having red hair.
A disambiguation component 2004 modifies the text information based on the image information to reduce ambiguity in the text information. As one part of its analysis, the disambiguation component 2002 performs co-reference resolution. It does so by identifying an ambiguous term in the input text, such as a pronoun (e.g., “he,” “she,” “their,” “this,” “it,” “that,” etc.). It then replaces or supplements the ambiguous term with textual information extracted from the image information. For example, per the example of
The disambiguation component 2004 can operate based on a set of context-specific rules. According to one rule, the disambiguation component 2004 replaces an ambiguous term in the input text with a gender-appropriate entity name specified in the image information. In another implementation, the disambiguation component 2004 can use a machine-trained model to determine the best match between an ambiguous term in the input text and plural entity names specified in the image information. A training system can train this model based on a corpus of training examples. Each positive training example pairs an ambiguous term in an instance of input text with an appropriate label associated with an object in an image.
The query expansion components 526 can also include one or more additional classification components 2006. Each classification component can receive text information from the text analysis engine 516 and image information from the image analysis engine 518. It can then generate some classification result that depends on insight extracted from the input text and input image. For example, one such classification model can classify the intent of the user's plural-mode query based on both the text information and the image information.
While the query expansion component 526 plays a role in the text-based retrieval path, the supplemental input component 1508 (of
B. Illustrative Processes
In block 2202 of
C. Representative Computing Functionality
The computing device 2502 can include one or more hardware processors 2504. The hardware processor(s) can include, without limitation, one or more Central Processing Units (CPUs), and/or one or more Graphics Processing Units (GPUs), and/or one or more Application Specific Integrated Circuits (ASICs), etc. More generally, any hardware processor can correspond to a general-purpose processing unit or an application-specific processor unit.
The computing device 2502 can also include computer-readable storage media 2506, corresponding to one or more computer-readable media hardware units. The computer-readable storage media 2506 retains any kind of information 2508, such as machine-readable instructions, settings, data, etc. Without limitation, for instance, the computer-readable storage media 2506 may include one or more solid-state devices, one or more magnetic hard disks, one or more optical disks, magnetic tape, and so on. Any instance of the computer-readable storage media 2506 can use any technology for storing and retrieving information. Further, any instance of the computer-readable storage media 2506 may represent a fixed or removable unit of the computing device 2502. Further, any instance of the computer-readable storage media 2506 may provide volatile or non-volatile retention of information.
The computing device 2502 can utilize any instance of the computer-readable storage media 2506 in different ways. For example, any instance of the computer-readable storage media 2506 may represent a hardware memory unit (such as Random Access Memory (RAM)) for storing transient information during execution of a program by the computing device 2502, and/or a hardware storage unit (such as a hard disk) for retaining/archiving information on a more permanent basis. In the latter case, the computing device 2502 also includes one or more drive mechanisms 2510 (such as a hard drive mechanism) for storing and retrieving information from an instance of the computer-readable storage media 2506.
The computing device 2502 may perform any of the functions described above when the hardware processor(s) 2504 carry out computer-readable instructions stored in any instance of the computer-readable storage media 2506. For instance, the computing device 2502 may carry out computer-readable instructions to perform each block of the processes described in Section B.
Alternatively, or in addition, the computing device 2502 may rely on one or more other hardware logic units 2512 to perform operations using a task-specific collection of logic gates. For instance, the hardware logic unit(s) 2512 may include a fixed configuration of hardware logic gates, e.g., that are created and set at the time of manufacture, and thereafter unalterable. Alternatively, or in addition, the other hardware logic unit(s) 2512 may include a collection of programmable hardware logic gates that can be set to perform different application-specific tasks. The latter category of devices includes, but is not limited to Programmable Array Logic Devices (PALs), Generic Array Logic Devices (GALs), Complex Programmable Logic Devices (CPLDs), Field-Programmable Gate Arrays (FPGAs), etc.
In some cases (e.g., in the case in which the computing device 2502 represents a user computing device), the computing device 2502 also includes an input/output interface 2516 for receiving various inputs (via input devices 2518), and for providing various outputs (via output devices 2520). Illustrative input devices include a keyboard device, a mouse input device, a touchscreen input device, a digitizing pad, one or more static image cameras, one or more video cameras, one or more depth camera systems, one or more microphones, a speech recognition mechanism, any movement detection mechanisms (e.g., accelerometers, gyroscopes, etc.), and so on. One particular output mechanism may include a display device 2522 and an associated graphical user interface presentation (GUI) 2524. The display device 2522 may correspond to a liquid crystal display device, a light-emitting diode display (LED) device, a cathode ray tube device, a projection mechanism, etc. Other output devices include a printer, one or more speakers, a haptic output mechanism, an archival mechanism (for storing output information), and so on. The computing device 2502 can also include one or more network interfaces 2526 for exchanging data with other devices via one or more communication conduits 2528. One or more communication buses 2530 communicatively couple the above-described units together.
The communication conduit(s) 2528 can be implemented in any manner, e.g., by a local area computer network, a wide area computer network (e.g., the Internet), point-to-point connections, etc., or any combination thereof. The communication conduit(s) 2528 can include any combination of hardwired links, wireless links, routers, gateway functionality, name servers, etc., governed by any protocol or combination of protocols.
The following summary provides a non-exhaustive set of illustrative aspects of the technology set forth herein.
According to a first aspect, one or more computing devices for providing query results are described. The computing device(s) include hardware logic circuitry, itself including: (a) one or more hardware processors that perform operations by executing machine-readable instructions stored in a memory, and/or (b) one or more other hardware logic units that perform operations using a task-specific collection of logic gates. The operations include: receiving an input image from a user in response to interaction by the user with a camera or a graphical control element that allows the user to select an already-existing image; receiving an instance of input text from the user in response to interaction by the user with a text input device and/or a speech input device; identifying at least one object depicted by the input image using an image analysis engine, to provide image information, the image analysis engine being implemented by the hardware logic circuitry; identifying one or more characteristics of the input text using a text analysis engine, to provide text information, the text analysis engine being implemented by the hardware logic circuitry; providing query results based on the image information and the text information; and sending the query results to an output device.
According to a second aspect, the operation of receiving the input image and the operation of receiving the input text occur within a single turn of a query session.
According to a third aspect, relating to the second aspect, the operation of receiving the input image and the operation of receiving the input text occur in response to interaction by the user with a user interface presentation that enables the user to provide the input image and the input text in the single turn.
According to a fourth aspect, the operation of receiving the input image occurs in a first turn of a query session and the operation of receiving the input text occurs in a second turn of the query session, the first turn occurring prior to or after the second turn.
According to a fifth aspect, the query results are provided by a question-answering engine. The operations further include: determining that a dialogue state of a dialogue has been reached in which a search intent of a user remains unsatisfied after one or more query submissions; and prompting the user to submit another input image in response to the operation of determining.
According to a sixth aspect, the operation of identifying one or more characteristics of the input text includes identifying an intent of the user in submitting the text.
According to a seventh aspect, the operation of identifying at least one object in the input image includes using a machine-trained classification model to identify the at least one object.
According to an eighth aspect, relating to the seventh aspect, the operations further include selecting the at least one machine-trained classification model based on the text information provided by the text analysis engine.
According to a ninth aspect, the operation of identifying at least one object includes: mapping the input image into one or more latent semantic vectors; and identifying one or more candidate images that match the input image based on the one or more latent semantic vectors, to provide the image information.
According to a tenth aspect, relating to the ninth aspect, wherein the operation of identifying one or more candidate images is further constrained to find the one or more candidate images based on at least part of the text information provided by the text analysis engine. Here, the query results include the image information itself.
According to an eleventh aspect, the operation of providing includes: modifying the text information based on the image information to produce a reformulated text query; submitting the reformulated text query to a text-based query-processing engine; and receiving, in response to operation of submitting, the query results from the text-based query-processing engine.
According to a twelfth aspect, relating to the eleventh aspect, the operation of modifying includes replacing a first term in the input text with a second term included in the image information, or appending the second term to the input text.
According to a thirteenth aspect, relating to the twelfth aspect, the operations further include using the input text to filter the image information, to select the second term that is used to modify the input text.
According to a fourteenth aspect, the operations further include: selecting a search mode for use in providing the query results based on the image information and/or the text information. The operation of providing provides the query results in a manner that conforms to the search mode.
According to a fifteenth aspect, a computer-implemented method is described for providing query results. The method includes: providing a user interface presentation that enables a user to input query information using two or more input devices; receiving an input image from the user in response to interaction by the user the user interface presentation; receiving an instance of input text from the user in response to interaction by the user with the user interface presentation, the operation of receiving the input text occurring in a same turn of a query session as the operation of receiving the input image; identifying at least one object depicted by the input image using an image analysis engine, to provide image information; identifying one or more characteristics of the input text using a text analysis engine, to provide text information; providing query results based on the image information and the text information; and sending the query results to an output device.
According to a sixteenth aspect, relating to the fifteenth aspect, the operation of identifying at least one object includes: mapping the input image into one or more latent semantic vectors; and identifying one or more candidate images that match the input image based on the one or more latent semantic vectors, to provide the image information. The operation of identifying one or more candidate images is further constrained to find the one or more candidate images based on at least part of the text information provided by the text analysis engine. Here, the query results include the image information itself.
According to a seventeenth aspect, relating to the fifteenth aspect, the operation of providing includes: modifying the text information based on the image information to produce a reformulated text query; submitting the reformulated text query to a text-based query-processing engine; and receiving, in response to the operation of submitting, the query results from the text-based query-processing engine.
According to an eighteenth aspect, a computer-readable storage medium for storing computer-readable instructions is described. The computer-readable instructions, when executed by one or more hardware processors, perform a method that includes: receiving an input image from a user in response to interaction by the user with a camera or a graphical control element that allows the user to select an already-existing image; receiving additional input information from the user in response to interaction by the user with another input device, the operation of receiving additional input information using a different mode of expression compared to the operation of receiving an input image; identifying at least one object depicted by the input image using an image analysis engine, to provide image information; identifying one or more characteristics of the additional input information using another analysis engine, to provide added information; selecting a search mode for use in providing query results based on the image information and/or the added information; providing the query results based on the image information and the added information in a manner that conforms to the search mode; and sending the query results to an output device.
According to an nineteenth aspect, relating to the eighteenth aspect, the operation of receiving the input image and the operation of receiving the additional input information occur within a single turn of a query session.
According to a twentieth aspect, relating to the eighteenth aspect, the additional input information is text.
A twenty-first aspect corresponds to any combination (e.g., any logically consistent permutation or subset) of the above-referenced first through twentieth aspects.
A twenty-second aspect corresponds to any method counterpart, device counterpart, system counterpart, means-plus-function counterpart, computer-readable storage medium counterpart, data structure counterpart, article of manufacture counterpart, graphical user interface presentation counterpart, etc. associated with the first through twenty-first aspects.
In closing, the functionality described herein can employ various mechanisms to ensure that any user data is handled in a manner that conforms to applicable laws, social norms, and the expectations and preferences of individual users. For example, the functionality can allow a user to expressly opt in to (and then expressly opt out of) the provisions of the functionality. The functionality can also provide suitable security mechanisms to ensure the privacy of the user data (such as data-sanitizing mechanisms, encryption mechanisms, password-protection mechanisms, etc.).
Further, the description may have set forth various concepts in the context of illustrative challenges or problems. This manner of explanation is not intended to suggest that others have appreciated and/or articulated the challenges or problems in the manner specified herein. Further, this manner of explanation is not intended to suggest that the subject matter recited in the claims is limited to solving the identified challenges or problems; that is, the subject matter in the claims may be applied in the context of challenges or problems other than those described herein.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.