With the arrival of the Internet Age, accessing information from sources around the world can be as simple as a few strokes on a keyboard and/or a few mouse clicks on a networked device. Information such as texts, images and video clips can be uploaded to a given database and downloaded from the database through the Internet. When a user desires to obtain certain information from the Internet, the user typically enters a user query via a user interface, such as an Internet browser for example, on a personal computer, laptop computer, mobile phone, or any device that is connected to the Internet. The user query is provided to a search engine that conducts search based on the user query to retrieve results from the search to be displayed to the user for further action by the user.
As the amount of image content on the Internet rises, more and more images are available on the Internet for viewing, commenting, sharing and downloading. To facilitate searching of desired images by users of the Internet, image search engines have been developed. Existing image search engines often provide a separate interface for a user to enter the user query, which typically consists of a textual input entered by the user. The textual input can be entered, for example, by the user keying in texts in a user query input box in the interface provided by the image search engine. Alternatively, the textual input can be entered by the user copying a word or phrase from a document, e.g., a web page, and pasting the copied word or phrase into the user query input box. The image search engine then uses the user query to search for and retrieve a set of images in an order that is ranked according to the extent that the text in the user query matches the texts associated with each of the retrieved images.
When the user query consists of a word or phrase copied from a document, such as the web page that the user is viewing at the time for example, it is likely that the document contains contextual information that can help refine the meaning of the user query and, more specifically, the intent of the user. Consequently, results of image search under the aforementioned approach may be limited and less than optimal. This is because only the textual input entered by the user is investigated for image search while the context surrounding the copied word or phrase is not taken into consideration by the image search engine.
Techniques for image search using contextual information related to a user query are described. One technique first ranks images retrieved from a search according to a user query that includes textual data and then ranks the images according to contextual information related to the textual data. In other techniques, the retrieved images are first ranked according to a user query that includes image data and then ranks the images according to contextual information related to the image data.
This summary is provided to introduce concepts relating to contextual image search. These techniques are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in determining the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same reference numbers in different figures indicate similar or identical items.
This disclosure describes techniques for image search using contextual information related to a user query. When a user views a document on a computing device, the user may select a word, phrase, image or video frame that is part of the document to submit the selected word, phrase, image or video frame as the user query to a client software application on the computing device for an image search. The client software application may automatically capture contextual information associated with the selected word, phrase, image or video frame and submit both the user query and the contextual information to a contextual image search engine. The contextual information may include one or more texts, images or video frames surrounding the selected word, phrase, image or video frame. Accordingly, the image search is not based on only the user query but also augmented by the contextual information related to the user query.
Images are retrieved from the image search based on a match between the user query and the retrieved images. The retrieved images are pre-ranked according to the similarity between the user query and at least one attribute of each of these images. Afterwards, the retrieved images are re-ranked according to the similarity between the contextual information and at least one attribute of each of these images. Finally, the retrieved images are presented to the user in the re-ranked order.
The contextual image search engine may be implemented in the form of computer programs, instructions, codes, logic or computer hardware that execute contextual image searching algorithm. Although the contextual image search engine may reside on a server that is communicatively coupled to the user's computing device, alternatively the contextual image search engine may reside on the computing device either partially or entirely. In the case that the contextual image search engine resides on the computing device, the client software application may be a part of the contextual image search engine. Moreover, in addition to searching one or more databases on the Internet or a local network, the image search may also be conducted on a local database in the computing device itself such as, for example, the local drive of a personal computer.
While aspects of described techniques relating to contextual image search can be implemented in any number of different computing systems, environments, and/or configurations, embodiments are described in context of the following exemplary system architecture(s).
When viewing the document 110, the user may desire to look up images related to textual data, such as a word or phrase, or image data, such as an image or a frame of a video clip, contained in the document 110. To do so, the user selects and submits at least one word, phrase, image, or video frame as the user query 120 to a contextual image search engine, which then retrieves still images or videos based on the submitted user query 120. In one embodiment, the selected textual or image data is highlighted by the user. Alternatively, other known methods of selecting textual or image data from a document may be employed. The submission of the selected textual or image data as the user query 120 to the contextual image search engine may be rendered by a client software application that resides on the computing device. In the interest of brevity, details of selecting textual or image data from the document 110 and submitting the selected textual or image data as the user query 120 to the contextual image search engine will not be described herein.
With textual or image data selected from the document 110 and identified as the user query 120, the client software application performs context extraction 160 to extract, or capture, contextual information 170 from the document 110. In general, contextual information 170 refers to additional data from the document 110 that is different from and related to the user query 120, whether the user query 120 includes textual data (denoted as qT) or image data (denoted as qi). Contextual information 170 of the user query 120 may contain at least one of three types of elements, namely: textual element 170a, image element 170b and video element 170c.
The textual element 170a, denoted as (tc, WT), is a dense representation that can be obtained by analyzing the document 110. The textual element 170a is represented in a vector space model by the vector tc and the corresponding weight is denoted by WT. In this model, extracted terms in the context information 170 are typically associated with weights that represent the importance of a term.
The image element 170b is obtained by analyzing the document 110, and may include one or more images and/or texts surrounding the images. The image element 170b is denoted as (Ic, TI, wI), where Ic and TI are matrices with each column corresponding to a respective one of the images, and where wI is the weight vector of each of the images. In one embodiment, features such as color moment and shape feature are extracted to represent one or more images. Each image is associated with a weight to represent its importance according to the distance between the respective image and the user query 120.
Similarly, the video element 170c is obtained by analyzing the document 110, and may include one or more videos and/or texts surrounding each of the videos. The video element 170c is denoted as (Vc, TV, WV), where Vc and TV are matrices with each column corresponding to a respective one of the videos, and where wV is the weight vector of each of the videos. In one embodiment, visual features of certain key frames of each video are extracted.
In the event that the user query 120 consists of textual data, the textual element 170a of contextual information 170 is captured as described below. Textual data occurring spatially around the textual data contained in the user query 120 and the title of the document 110 are extracted as the textual element 170a, which is represented as a vector. The associated weights are set according to the spatial distance from the user query 120, and the title of the document 110 is assigned a smaller weight.
In the event that the user query 120 consists of a selected image or video frame, the textual element 170a of contextual information 170 is captured as described below. Textual data occurring spatially around the user query 120, the file name of the selected image contained in the user query 120 and the title of the document 110 are extracted as the textual element 170a, which is represented as a vector. In this case, the textual element 170a includes one or more suggested textual queries. The associated weights are set according to the spatial distance from the user query 120, the file name of the selected image is assigned a larger weight, and the title of the document 110 is assigned a smaller weight.
The image element 170b of contextual information 170 is captured in the same manner whether the user query 120 consists of textual data or image data. The images in the document 110 are all involved and the texts surrounding these images are also extracted. The weights are set according to the distance from the user query 120. The video element 170c of contextual information 170 is captured similarly to how the image element 170b is captured. As techniques for extracting contextual information 170 are not the focus of the present disclosure, details of context extraction 160 will not be described in the interest of brevity.
Upon receiving the user query 120, the contextual image search engine performs search and pre-ranking 130 of images based on the user query 120 to retrieve and rank images that have at least one attribute matching the user query 120. During the process of image searching, the contextual image search engine examines a plurality of images or image files stored in one or more databases to retrieve images with at least one attribute that matches the user query 120. For example, when the user query 120 includes textual data, the retrieved images from the image search have associated texts, such as the respective file name for example, matching the textual data of the user query 120. The initial result of the search by the contextual image search engine is a first set of images from the plurality of images examined by the contextual image search engine. An image file refers to a file that contains one image, and may also contain textual information describing, or otherwise associated with, the image in the file.
In pre-ranking the retrieved images when the user query 120 consists of textual data, the textual data of the user query 120 is used to rank the retrieved images to provide an ordered, or pre-ranked, set of images 140, denoted as {I1, I2, . . . , In}, with rank values {r1, r2, . . . , rn}. Techniques for ranking the retrieved images are well known in the art and will not be described in detail in the interest of brevity.
With the pre-ranked set of images 140, the contextual image search engine performs re-ranking 180 of the pre-ranked set of images 140 based on contextual information 170 to provide a re-ranked set of images 150. The re-ranked set of images 150 is displayed on the computing device as search result for viewing by the user.
In re-ranking the pre-ranked set of images 140, one or more of the textual element 170a, image element 170b and video element 170c of contextual information 170 may be used. More specifically, a rank {hacek over (r)}i for each image Ii is computed, where the rank {hacek over (r)}i is a combination of a rank based on the textual element 170a, a rank based on the image element 170b and a rank based on the video element 170c.
To obtain the rank based on the textual element 170a, the weighted similarity between texts in the textual element 170a and texts associated with each image of the pre-ranked set of images 140 is computed. A sparse word similarity matrix W with each entry representing the similarity between the corresponding words is thus provided. Mathematically, the rank based on the textual element 170a is expressed as follows:
where ti is the textual data associated with image Ii.
To obtain the rank based on the image element 170b, the weighted aggregation of the ranks of all the images in the image element 170b is computed. The rank contribution for each image in the image element 170b consists of two components: one from the surrounding texts and the other from visual feature of the respective image. The rank contribution from the text of image Ik is similar to that of the rank based on the textual element 170a, and is mathematically expressed as follows:
{hacek over (r)}
It
ki=tTIkW ti,
where tIk is the textual data associated with image Ik in the image element 170b, and ti is the textual data associated with image Ii.
The rank contribution from the visual information is obtained as follows:
{hacek over (r)}
Iv
ki=(fIk−fi)T(fIk−fi),
where fIk is the visual feature of image Ik in the image element 170b.
Then, the rank based on the image element 170b is expressed as follows:
The rank based on the video element 170c can be obtained similarly as for the rank based on the image element 170b. The rank contribution for each image, or frame, in the video element 170c consists of two components: one from the surrounding texts and the other from visual feature of the respective image. The rank contribution from the text can be mathematically expressed as follows:
{hacek over (r)}
Vt
kitTVk W ti,
where tVk is the textual data associated with video Vk in the video element 170c, and ti is the textual data associated with image Ii.
The rank contribution from the visual information of video Vk is obtained as follows:
where fVjk is the visual feature of the jth key feature of video Vk.
Then, the rank based on the video element 170c is expressed as follows:
The final rank of an image is obtained by combining the above ranks together, and is used to order the pre-ranked set of images 140 into the re-ranked set of images 150. The final rank can be mathematically expressed as follows:
{hacek over (r)}
i
=βr
i+(1−β)({hacek over (r)}ti+{hacek over (r)}Ii+{hacek over (r)}Vi).
In at least one configuration, computing device 200 typically includes at least one processing unit 202 and system memory 204. Depending on the exact configuration and type of computing device, system memory 204 may be volatile (such as random-access memory, or RAM), non-volatile (such as read-only memory, or ROM, flash memory, etc.) or some combination thereof. System memory 204 may include an operating system 206, one or more program modules 208, and may include program data 210. The computing device 200 is of a very basic configuration demarcated by a dashed line 214. Again, a terminal may have fewer components but may interact with a computing device that may have such a basic configuration.
The program module 208 includes a contextual image search module 212. The contextual image search module 212 retrieves images based on a match between the user query 120 and the retrieved images. The contextual image search module 212 may carry out one or more processes as described with reference to
In one embodiment, the contextual image search module 212 pre-ranks the retrieved images to provide the pre-ranked set of images 140 according to similarity between the user query 120 and at least one attribute of each of these images. The contextual image search module 212 then re-ranks the pre-ranked set of images 140 to provide the re-ranked set of images 150 according to similarity between the contextual information 170 and at least one attribute of each image of the pre-ranked set of images 140. Finally, the re-ranked set of images 150 is presented to the user in the re-ranked order, for example, by being displayed on the output device 222 of the computing device 200 or on another computing device 226.
In another embodiment, the contextual image search module 212 receives a user query entered by a user. The user query includes textual data, such as one or more words, or image data, such as an image, and is selected from a collection of data, such as data displayed on a web page on a computing device. The contextual image search module 212 also receives another set of data from the collection of data as contextual information that is related to the user query but different from the user query. The contextual image search module 212 identifies a first subset of data files from data files stored in one or more databases, where the first subset of data files are ranked in a first order. That is, the data files of the identified first subset are ranked in an order according to similarity between information contained in the user query and at least one attribute of some or all of the data files of the data files searched. In one embodiment, the data files are image files each containing an image. For example, where the user query is an image displayed on the web page, each of the identified data files of the first subset may contain an image that has some attribute similar to the respective attribute of the image of the user query. In another embodiment, the data files are video files each containing a video clip that includes a plurality of video frames. Accordingly, each of the identified data files of the first subset may contain a video frame that has some attribute similar to the respective attribute of the image of the user query. The contextual image search module 212 then identifies a second subset of data files from the first subset, where the data files of the second subset are ranked in a second order according to similarity between the contextual information and at least one attribute of some or all of the data files of the first subset. The number of data files in the second subset may be less than or equal to the number of data files in the first subset. Thereafter, images representative of the data files of the second subset are provided to an output device 222, or another display device not part of the computing device 200, to be displayed in the second order.
Computing device 200 may have additional features or functionality. For example, computing device 200 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
Computing device 200 may also contain communication connections 224 that allow the computing device 200 to communicate with other computing devices 226, such as over a network which may include one or more wired networks as well as wireless networks. Communication connections 224 are some examples of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, etc.
It is appreciated that the illustrated computing device 200 is only one example of a suitable device and is not intended to suggest any limitation as to the scope of use or functionality of the various embodiments described. Other well-known computing devices, systems, environments and/or configurations that may be suitable for use with the embodiments include, but are not limited to, personal computers (PCs), server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-base systems, set top boxes, game consoles, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and/or the like.
Context extraction 360 is performed to obtain contextual information 370 from the document 310. Contextual information 370 is related to and different from the textual data contained in the user query 320, and may include a textual element 370a, an image element 370b, a video element 370c or a combination thereof. For example, the textual element 370a may include the text displayed spatially around the user query 320 and the title of the displayed document 310, the image element 570b may include other images displayed in the document 510, and the video element 570c may include one or more frames from a video clip included in the document 510. With contextual information 370, the first subset of images 340 are ranked in a re-ranked order according to similarity between contextual information 370 and at least one attribute of the images of the first subset to provide a second subset of images 350. When displayed to the user, the images of the second subset of images 350 are displayed in the re-ranked order.
In one embodiment, the actions of searching, pre-ranking and re-ranking of images as depicted in the architecture 300 are performed by a computing device like the computing device 200 of
A suggested textual query 420 includes textual data 422 from the document 410 is used to perform a text-based image search 425. In one embodiment, the suggested textual query 420 is obtained by dividing the text surrounding the user query 420 to a number of keywords as the textual data 422. Context extraction 460, on the other hand, provides contextual information 470 that includes a textual element 470a, an image element 470b and a video element 470c. Contextual information 470 is related to and different from the image data contained in the user query 415. The textual data 422 contained in the suggested textual query 420 may be part of the textual element 470a of contextual information 470. Depending on the number of words and/or phrases in the textual data 422, in one embodiment, the text-based image search 425 yields a number of sets of images 428a-428c where each set of images corresponds to a respective one of the number of words and/or phrases in the textual data 422.
The sets of images 428a-428c are pre-ranked using the user query 415, which is an image query containing image data, to provide a first subset of images 440. The images 440 of the first subset are ranked in the pre-ranked order according to similarity between the user query 415 and at least one attribute, such as color moment or visual feature, of each image of the first subset of images 440. With contextual information 470, the first subset of images 440 are ranked in a re-ranked order according to similarity between contextual information 470 and at least one attribute of the images of the first subset to provide a second subset of images 450. When displayed to the user, the second subset of images 450 is displayed in the re-ranked order.
In one embodiment, the actions of searching, pre-ranking and re-ranking of images as depicted in the architecture 400 are performed by a computing device like the computing device 200 of
Context extraction 560 is performed to obtain contextual information 570 from the document 510. Contextual information 570 is related to and different from the image data contained in the user query 520, and may include a textual element 570a, an image element 570b, a video element 570c or a combination thereof. For example, the textual element 570a may include the text displayed spatially around the user query 520 and the title of the displayed document 510, the image element 570b may include other images displayed in the document 510, and the video element 570c may include one or more frames from a video clip included in the document 510. With contextual information 570, the first subset of images 540 are ranked in a re-ranked order according to similarity between contextual information 570 and at least one attribute of the images of the first subset to provide a second subset of images 550. When displayed to the user, the images of the second subset 550 are displayed in the re-ranked order.
In one embodiment, the actions of searching, pre-ranking and re-ranking of images as depicted in the architecture 500 are performed by a computing device like the computing device 200 of
In one embodiment, when the user query includes textual data, such as one or more words, displayed by the computing device, the contextual information includes the text displayed spatially around the user query and the title of the displayed document.
In one embodiment, when the user query includes an image displayed by the computing device, the contextual information includes at least one of a color moment or a shape feature of at least one displayed image other than the user query. In an alternative embodiment, when the user query includes an image or a frame of a video displayed by the computing device, the contextual information includes at least one visual feature of at least one frame of the video displayed by the computing device.
In one embodiment, when receiving at least one other subset of data from the collection of data as contextual information that is related to and different from the user query, the process 800 identifies at least one instance of textual data displayed in a spatial vicinity of the user query, a title of a document that contains data identified as the user query, or a combination thereof as the contextual information when the user query includes an instance of textual data displayed by the computing device. For example, the contextual information may be represented as a vector, each of the identified at least one instance of textual data may be assigned a respective weight according to a respective distance between the user query and the respective instance of textual data, and the identified title of the document may be assigned a weight smaller than the respective weight of each of the identified at least one instance of textual data.
In one embodiment, when receiving at least one other subset of data from the collection of data as contextual information that is related to and different from the user query, the process 800 identifies at least one instance of textual data displayed in a spatial vicinity of the user query, an image file name related to the user query, a title of a document that contains data identified as the user query, at least one displayed image other than the user query, at least one instance of textual data in a spatial vicinity of the at least one displayed image other than the user query, at least one frame of a video clip, or a combination thereof as the contextual information when the user query includes an image displayed by the computing device. For example, the contextual information may be represented as a vector. Each of the identified at least one instance of textual data, each of the at least one displayed image other than the user query, each of the identified at least one instance of textual data in a spatial vicinity of the at least one displayed image other than the user query, and each of the at least one frame of the video clip may be assigned a respective weight according its respective spatial distance from the user query. The identified title of the document may be assigned a weight smaller than the respective weight of each of the identified at least one instance of textual data. In addition, the identified image file name of the user query may be assigned a weight larger than the respective weight of each instance of textual data as well as the respective weight of each of the at least one displayed image other than the user query.
In one embodiment, when identifying a first subset of data files, the process 800 ranks the first subset of data files in the first order according to similarity between textual data of the user query and textual data of individual data files of the plurality of data files that is related to an image contained in the respective data file.
In another embodiment, when identifying a first subset of data files from a plurality of data files, the data files of the first subset ranked in a first order according to similarity between information contained in the user query and at least one attribute of individual data files of the plurality of data files, the process 800 performs a number of activities. First, at least one instance of textual data related to the user query is identified when the user query includes an image. Next, a respective subset of data files are identified from the plurality of data files for each of the at least one instance of textual data related to the user query based on similarity between the respective instance of textual data related to the user query and textual data of each data file of the respective subset of data files that is related to an image contained in the respective data file. Moreover, data files are selected from each respective subset of data files that are identified for each of the at least one instance of textual data related to the user query to form the first subset of data files. The data files in the first subset of data files are arranged in the first order ranked according to similarity between the image of the user query and at least one image of each data file of the first subset of data files.
In yet another embodiment, when identifying a second subset of data files from the first subset of data files, the process 800 ranks each data file of the first subset of data files by comparing at least one of (1) one or more attributes of each data file of the first subset with a textual element of the contextual information, (2) one or more visual features of an image element and one or more text surrounding the image element of the contextual information, (3) one or more visual features of a video element of the contextual information or (4) one or more texts surrounding the video element of the contextual information.
In still another embodiment, when identifying a second subset of data files from the first subset of data files, the process 800 computes a final ranking score for the respective image of each data file of the second subset of data files. A respective first ranking score is computed according to similarity between a textual element of the contextual information and at least one instance of textual data related to the respective image associated with each data file of the second subset of data files. A respective second ranking score is also computed according to similarity between a visual feature and texts surrounding the visual feature of an image element of the contextual information and a respective visual feature of and textual data related to the respective image associated with each data file of the second subset of data files. A respective third ranking score is further computed according to similarity between a visual feature and texts surrounding the visual feature of a video element of the contextual information and a respective visual feature of and textual data related to the respective image associated with each data file of the second subset of data files. Finally, the respective first, second, and third ranking scores are combined, such as summed together for example, to provide the respective final ranking score for the respective image of each data file of the second subset of data files.
In one embodiment, when ranking a plurality of image files to provide a first list of image files in a first order, the process 900 identifies at least one instance of textual data displayed in a spatial vicinity of the user query when the user query includes a displayed image. The plurality of image files are ranked using each of the at least one instance of textual data displayed in a spatial vicinity of the user query to provide at least one pre-ranked list of image files. Further, each of the at least one pre-ranked list of image files is ranked using the displayed image of the user query to provide the first list of image files in the first order.
In one embodiment, when ranking the first list of image files to provide a second list of image files in a second order, the process 900 computes a respective final ranking score for each image file of the first list of image files. First, a respective first ranking score is computed according to similarity between a textual element of the contextual information and at least one instance of textual data related to each image file of the first list of image files. Next, a respective second ranking score is computed according to similarity between a visual feature and texts surrounding the visual feature of an image element of the contextual information and a respective visual feature of and textual data related to each image file of the first list of image files. Furthermore, a respective third ranking score is computed according to similarity between a visual feature and texts surrounding the visual feature of a video element of the contextual information and a respective visual feature of and textual data related to each image file of the first list of image files. Finally, the respective first, second, and third ranking scores are combined to provide the respective final ranking score for each image file of the first list of image files.
In one embodiment, the process 900 receives the user query, which includes a subset of data of the collection of displayed data. The process 900 also extracts at least one other subset of data from the collection of displayed data as the contextual information.
In one embodiment, the process 900 extracts at least one instance of textual data displayed in a spatial vicinity of the user query, a title of a document containing the user query, or a combination thereof as the contextual information when the user query includes an instance of textual data from the collection of displayed data. For example, the contextual information may be represented as a vector. Each of the extracted at least one instance of textual data may be assigned a respective weight according to a respective distance between the user query and the respective instance of textual data. Further, the extracted title of the document may be assigned a weight smaller than the respective weight of each of the extracted at least one instance of textual data.
In one embodiment, the process 900 extracts at least one instance of textual data displayed in a spatial vicinity of the user query, an image file name of the user query, a title of a document containing the user query, at least one displayed image other than the user query, at least one instance of textual data in a spatial vicinity of the at least one displayed image other than the user query, at least one frame of a video clip, or a combination thereof as the contextual information when the user query includes a displayed image from the collection of displayed data. For example, the context query may be represented as a vector. Each of the identified at least one instance of textual data, each of the at least one displayed image other than the user query, each of the identified at least one instance of textual data in a spatial vicinity of the at least one displayed image other than the user query, and each of the at least one frame of the video clip may be assigned a respective weight according its respective spatial distance from the user query. The identified title of the document may be assigned a weight smaller than the respective weight of each of the identified at least one instance of textual data. Additionally, the identified image file name of the user query may be assigned a weight larger than the respective weight of each instance of textual data and the respective weight of each of the at least one displayed image other than the user query.
The above-described techniques pertain to search of images using contextual information related to a user query. Although the techniques have been described in language specific to structural features and/or methodological acts, it is to be understood that the appended claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing such techniques.