The amount of information and content available on the Internet and/or stored on user devices continues to grow exponentially. Given the vast amount of information, search engines have been developed to facilitate searching. In particular, users may search for information and documents by entering search queries comprising one or more terms that may be of interest to the user. After receiving a search query from a user, a search engine identifies documents, web pages, and/or other content that are relevant based on the terms, and search results may be returned in response to the search query. Typically, the search results are provided on a search engine results page (“SERP”).
Users are often searching for information about a particular entity. Entities are instances of abstract concepts and objects, including people, places, things, events, locations, businesses, movies, and the like. Depending on the search query a user inputs or selects, the SERP may not include information about the particular entity the user is searching or the information may be difficult to find among the many search results returned.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Embodiments of the present invention relate to determining relevance of entities to search queries using a triangulation approach. The triangulation approach determines the relevance of an entity to a search query as a function of the relevance of search result documents to the search query and relevance of the entity to the search result documents. When a search query is received, search result documents may be identified, and relevance of each search result document to the search query may be determined. Additionally, entities discussed in the search result documents and the relevance of each entity to each search document may also be identified. The relevance of each entity to the search query may be determined based on the relevance of the search result documents to the search query and the relevance of each entity to the search result documents. Entity relevance to the search query may be used when providing a search result experience in response to the received search query.
The present invention is described in detail below with reference to the attached drawing figures, wherein:
The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and/or “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
Embodiments of the present invention are directed to determining relevance of entities to search queries using a triangulation approach. Search engine result pages often contain heterogeneous search results from numerous different document sources. For a given search query and set of search results, the search results may be closely related to a single dominant entity or a set of entities such as person, place, song, etc. Embodiments of the present invention determine the dominance of one or more entities to a search query using a triangulation technique, which combines the relevance of an entity to each document and the relevance of each document to the search query. Triangulating the dominant entities in this fashion allows for creating a summarization of the search results that is centered on the most dominant entity or entities for a search query. This summarization may, among other things, provide relevant information about the dominant entity or entities and may reinforce with the user how the search engine interpreted the user's search query.
Accordingly, in one aspect, an embodiment of the present invention is directed to a method for identifying relevance of an entity to a search query. The method includes receiving the search query and identifying a plurality of documents based on the search query. The method also includes determining a relevance of each document to the search query and determining a relevance of the entity to each document. The method further includes determining a relevance of the entity to the search query as a function of the relevance of each document to the search query and the relevance of the entity to each document.
In another embodiment, an aspect is directed to one or more computer storage media comprising computer-useable instructions that, when used by one or more computing devices, cause the one or more computing devices to perform a method. The method includes receiving a search query and identifying a plurality of documents based on the search query. The method also includes, for each document, determining a relevance of the document to the search query, and accessing entity information indexed for the document in a search engine index, the entity information identifying a relevance of each of one or more entities to the document. The method further includes determining a relevance for each of a plurality of entities to the search query, the relevance for each entity to the search query being determined based at least in part on the relevance of the entity to each document and the relevance of each document to the search query. The method still further includes identifying a first entity as a dominant entity based on the relevance for each of the plurality of entities to the search query, and providing a search results page generated based at least in part on identifying the first entity as the dominant entity.
A further embodiment of the present invention is directed to a computerized system that includes one or more processors and one or more computer storage media. The system further includes a document understanding component, a document relevance component, an entity/query relevance component, and a user interface component. The document understanding component is configured to identify one or more entities discussed in each of a plurality of documents and determine a relevance of each entity to each document. The document relevance component is configured to identify a set of relevant documents based on a search query and a relevance of each relevant document from the to the search query. The entity/query relevance component configured to identify a relevance of one or more entities to the search query based on the relevance of each relevant document to the search query and the relevance of each of the one or more entities to each relevant document. The user interface component is configured to provide a search results page generated at least in part based on the relevance of the one or more entities to the search query.
Having briefly described an overview of embodiments of the present invention, an exemplary operating environment in which embodiments of the present invention may be implemented is described below in order to provide a general context for various aspects of the present invention. Referring initially to
The invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. The invention may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, etc. The invention may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
With reference to
Computing device 100 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 100 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 100. Computer storage media does not comprise signals per se. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
Memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc. Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O components 120. Presentation component(s) 116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc.
I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O components 120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc. The I/O components 120 may provide a natural user interface (NUI) that processes air gestures, voice, or other physiological inputs generated by a user. In some instance, inputs may be transmitted to an appropriate network element for further processing. A NUI may implement any combination of speech recognition, touch and stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition associated with displays on the computing device 100. The computing device 100 may be equipped with depth cameras, such as, stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations of these for gesture detection and recognition. Additionally, the computing device 100 may be equipped with accelerometers or gyroscopes that enable detection of motion. The output of the accelerometers or gyroscopes may be provided to the display of the computing device 100 to render immersive augmented reality or virtual reality.
As discussed above, embodiments of the present invention are generally directed to determining the relevance of entities to a search query using a triangulation approach.
In some embodiments, the document analysis performed to identify entities within documents and the relevance of those entities to the documents may be done offline and each document may be “stamped” with the entities mentioned in the document, and each of these “stamps” can include an estimate of the relevance of the entity to the document. In other words, entity information may be indexed by a search engine for documents to indicate the entities mentioned by each document and the relevance of the entities to the documents.
The triangulation technique may also rely on an estimate of the relevance 204 of documents to a given search query. In terms of conditional probabilities, this is an estimate of P(Document|Query), which is the probability of the document given the search query.
During query time, N search result documents may be returned for a search query received at a search engine. Entities discussed in the search result documents can be identified (e.g., by retrieving information indexed for the documents), and the relevance 206 of each entity to the search query may be determined through a triangulation technique that combines the above-discussed two relevance estimates (i.e., the relevance 202 of entities to documents and the relevance 204 of documents to the search query). This may include an estimate P(Entity|Query), which is the probability of the entity given the query, as represented in the formula below.
Note that the above formula may assume that P(Entity|Query, Document)=P(Entity|Document), which is a safe assumption since the relevance of an entity to the document is not dramatically different for any given search query.
In practice, there may be many different techniques employed to derive estimates of the relevance of an entity to a document (i.e., P(Entity|Document)) and the relevance of a document to a search query (i.e., P(Document|Query)). Any and all combinations of these estimates can be leveraged to create many difference estimates of the relevance of entities to the search query, and each one of these estimates can be combined using, for instance, supervised machine learning.
If the relevance of an entity to a given search query is high enough, the entity may be identified as a dominant entity, and a search results experience may be provided based on the dominant entity. For instance, a search results page may be provide that includes, with other search results, a dominant entity summary area that displays images, facts, and/or other information that gives an overview of the dominant entity. In other instances in which a dominant entity is not identified (e.g., no entity has a sufficiently high relevance), an entity disambiguation search results experience may be provided. For instance, a search results page may be provided that identifies a number of entities and allows the user to select an entity to disambiguate the search.
Referring now to
It should be understood that any number of user computing devices 310 and/or search engines 320 may be employed in the computing system 300 within the scope of embodiments of the present invention. Each may comprise a single device/interface or multiple devices/interfaces cooperating in a distributed environment. For instance, the search engine 320 may comprise multiple devices and/or modules arranged in a distributed environment that collectively provide the functionality of the search engine 320 described herein. Additionally, other components or modules not shown also may be included within the computing system 300.
In some embodiments, one or more of the illustrated components/modules may be implemented as stand-alone applications. In other embodiments, one or more of the illustrated components/modules may be implemented via a user computing device 310, the search engine 320, or as an Internet-based service. It will be understood by those of ordinary skill in the art that the components/modules illustrated in
It should be understood that this and other arrangements described herein are set forth only as examples. Other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used in addition to or instead of those shown, and some elements may be omitted altogether. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, and in any suitable combination and location. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For instance, various functions may be carried out by a processor executing instructions stored in memory.
The user computing device 310 may include any type of computing device, such as the computing device 100 described with reference to
The search engine 320 generally operates to index information regarding documents served by content servers, such as the content server 340, in a search engine index 330 to facilitate provide search results identifying documents on content servers. In some cases, the search engine 320 may alternatively or additionally operate to index information stored on a user computing device 310 to facilitate a user searching for information on the user computing device 310. As used herein, the term “document” may refer to any type of electronic content, such as a web page, image, video, for which information may be indexed in the search engine index 330.
When the search engine 320 receives search queries from user computing devices 310, the search engine 320 queries the search engine index 330 to identify search results based on the users' search queries and returns those search results to the user devices. In accordance with embodiments of the present invention, the search engine 320 is also configured to, among other things, determine relevance of entities to search queries. Further, the search engine 320 may provide search results generated based at least in part on the entity relevance determination. This may include, for instance, providing search result pages that provide entity summary information and/or entity disambiguation options based on entity relevance determinations.
As illustrated, in various embodiments, the search engine 320 includes a user interface component 322, a document understanding component 324, a document relevance component 326, and an entity/query relevance component 328. The illustrated search engine 320 also has access to a search engine index 330. As noted above, the search engine index 330 stores information about documents to facilitate providing search results. In accordance with embodiments, the information stored for documents may include entity information, including identification of entities discussed within the documents and the relevance of the entities to the documents. It will be understood and appreciated by those of ordinary skill in the art that the information stored by the search engine index 330 may be configurable and may include any information relevant to search queries/terms/histories, entity identifications, entities, and metadata associated with the entities. The content and volume of such information are not intended to limit the scope of embodiments of the present invention in any way. Further, though illustrated as a single component, the search engine index 330 may, in fact, be a plurality of storage devices, for instance a database cluster, portions of which may reside in association with the user computing device 310, another external computing device (not shown), and/or any combination thereof.
The document understanding component 324 is configured to analyze documents (e.g., documents crawled on content servers, such as content server 340) to identify entities discussed or otherwise referenced on the documents. Additionally, the document understanding component 324 may operate to determine the relevance of a given entity referenced on a document to the document. Any number of different approaches could be used to identify an entity within a document and determine the relevance of the entity to the document. By way of example only and not limitation, relevance determination may employ multinomial naïve bayes or latent Dirichlet allocation techniques. In some embodiments, a single approach may be used for entity identification and/or relevance determination. In other embodiments, multiple approaches may be used in combination to derive the entity relevance. The document understanding component 324 may identify one or more entities referenced within a given document and may determine a relevance for each of those entities to the document. For instance, a web page primarily discussing Barack Obama may mention other people, such as Joe Biden and Michele Obama. The document understanding component 324 may identify each of these entities discussed on the web page and also determine a relevance of each entity to the web page. Because the web page is primarily discussing Barack Obama, the relevance determination would be greatest for Barack Obama and lower for the other people discussed on the web page.
While document understanding could be performed at run time after a search query has been received, in some embodiments, the document understanding component 324 may operate as an offline component to analyze documents and index information in the search engine index 330. In particular, information may be stored in the search engine index 330 in association with indications of documents to identify entities relevant to each document and the corresponding relevance of each entity to each document. The search engine index 330 may be continuously and/or periodically refreshed with information as the search engine 320 analyzes new documents and/or re-analyzes previously indexed documents.
When a search query is received from a user computing device 310, for instance, via the user interface component 322, the document relevance component 326 operates to determine the relevance of search result documents to the received search query. In particular, the search engine index 330 is queried to identify relevant search result documents. The relevance of each of those documents to the search query may be determined based on any of a variety of different search algorithms/approaches. In some cases, a single search algorithm/approach may be employed, while in other instances, multiple search algorithms/approaches may be used in combination to determine the relevance of each document to the search query. By way of example and not limitation, the search approach may employ various statistical techniques and/or machine learning techniques to generate relevance estimates based on various signals. The relevance estimate for a given document may be an estimate of, for instance, a probability a user is going to select the document and/or what relevance a panel of human judges would give to the document given the search query.
The entity/query relevance component 328 identifies entities referenced by the search result documents for the received search query (based on the document understanding component 324 and/or information indexed in the search engine index 330). Additionally, the entity/query relevance component 328 determines a relevance of each entity to the search query. Generally, for a given entity, the relevance of the entity to the search query may be determined as a function of the relevance of the entity to each search result document (as determined by the document understanding component 324 and/or indexed in the search engine index 330) and the relevance of each search result document to the search query (as determined by the document relevance component 326).
The entity/query relevance information determined by the entity/query relevance component 328 may be employed in the process of selecting search result information in response to a search query, which may be returned to a user computing device 310 via the user interface component 322. In some embodiments, a single entity may be identified as a dominant entity based on the entity/query relevance information. An entity may be identified as a dominant entity in a number of different manners. In some cases, an entity with the highest relevance to the search query is identified as the dominant entity. In other cases, an entity is determined to be the dominant entity only if the entity has the highest relevance to the search query and the entity's relevance to the search query exceeds a relevance threshold (predetermined or dynamic). In further cases, an entity may be determined to be the dominant entity only if the entity's relevance to the search query is significantly greater than the relevance for all other entities. Any and all combinations and variations thereof are contemplated to be within the scope of embodiments of the present invention.
Identification of a dominant entity may be used to generate search result information provided in response to the search query in a variety of different ways. For instance, entity summary information may be provided in addition to a search result listing on a search results page. An example of this is illustrated in
The identification of the dominant entity could also be used to affect the search results provided. For instance, the ordering of search results returned could be based in part on the relevance of the dominant entity to each search result document. This could include providing increased ranking to search result documents for which the dominant entity has a higher relevance.
In other embodiments, instead of identifying a dominant entity, multiple entities may be selected. This may occur in situations in which a dominant entity may not be present based on the entity/query relevance information, such as when the search query is ambiguous. For instance, a search query “jaguar” may be ambiguous as the user could be searching for information regarding the animal, the car manufacturer, the NFL football team, or some other entity. In such situations, multiple entities may have a relevance to the search query that exceeds some threshold or no entities may have a relevance to the search query that exceeds the threshold.
When multiple entities are selected, a number of search result experiences could be provided. In some instances, summary information may be provided for each of the selected entities in conjunction with a list of search results. This may depend on the number of entities selected and the screen space available for presenting the summary information. In some instances, a disambiguation experience may be provided. For instance, search result listings may be aggregated into different entity groups based on entity relevance for each search result document. Additionally or alternatively, user-selectable options may be provided that allow the user to make a disambiguation choice, selecting one of the identified entities for which the user is seeking information. A search result experience could be provided based on the user's selection, such as a search results page with summary information for the selected entity and/or search results selected and/or ordered based on the selected entity.
With reference now to
The relevance of a particular entity to identified documents is determined at block 508. In some embodiments, the relevance of entities to documents may be determined in a background or offline process, and information regarding the entity relevance may be stored in a search engine index, such as the search engine index 330 of
The relevance of the particular entity to the search query is determined at block 510 as a function of the relevance of the documents to the search query and the relevance of the particular entity to the documents. The relevance of the entity to the search query may be used in returning search results in response to the search query. For instance, the relevance of the entity to the search query may be used to identify the entity as a dominant entity and a search result experience returned based on the entity being identified as a dominant entity. In other embodiments, the entity may be selected with one or more other entities based on relevance of the entities to the search query, and a disambiguation search result experience may be provided based on those entities.
Turning now to
Indexed entity information is retrieved at block 608 for each identified document and/or each with relevance to the search query above a certain threshold (or other subset of identified documents). In particular, the documents may have been processed previously to identify entities discussed in the documents and to calculate the relevance of each entity discussed in each document to the document in which it is discussed. As such, the search engine index may identify for each document, each entity discussed in the document and the relevance of each entity to the document.
The relevance of each entity to the search query is determined at block 610 as a function of the relevance of the documents to the search query and entity information accessed at block 608. The entity information used to determine the relevance of each entity to the search query includes a relevance of each entity to the documents. A dominant entity is determined at block 612 based on each entity's relevance to the search query. A dominant entity may be identified in a number of different ways within the scope of embodiments of the present invention. For example, an entity with the greatest relevance to the search query may be identified as the dominant entity. In some cases, the entity must have a relevance to the search query that excess a relevance threshold to be considered the dominant entity.
A search results page generated at least in part based on the dominant entity is provided at block 614. In some embodiments, an entity summary area may be included on the search results page to provide general information about the dominant entity. The entity summary area may be provided in addition to search results selected based on the search query. In some embodiments, the search result selection and/or ranking (i.e., ordering) may be based at least in part on the dominant entity. For instance, search results for documents for which the dominant entity has a higher relevance may be given greater ranking so the search results appear higher in the search result listing.
As can be understood, embodiments of the present invention provide a triangulation approach for estimating the relevance of entities to a given search query as a function of the relevance of search result documents to the search query and relevance of the entities to the search result documents. The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present invention pertains without departing from its scope.
From the foregoing, it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages which are obvious and inherent to the system and method. It will be understood that certain features and subcombinations are of utility and may be employed without reference to other features and subcombinations. This is contemplated by and is within the scope of the claims.