In formulating requests for information, for instance, in formulating search queries for searches of networked resources such as searches conducted using the Internet, entities are often referred to ambiguously, and a request for information about one entity often results in information pertaining to multiple entities having similar or identical entity names. As users are generally looking for information about only one of the multiple entities, much of the information returned as a result of the information request is not relevant to the user.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Embodiments of the present invention relate to systems, methods, and computer-readable storage media for disambiguating entity names by identifying query terms associated with certain entities (such as people, places, or products, among other things) based on, for instance, user selection of Uniform Resource Locators (URLs). Queries are analyzed based on user selection of a particular URL, a quantity of user selections associated with the particular URL, and a total number of user selections of other URLs, in response to execution of the query. Once a particular query is associated with a particular URL and, accordingly, with a particular entity, upon receipt of the particular query, information (e.g., search results, images to supplement search results, advertising, or the like) that is associated with the appropriate entity may be returned providing more relevant information to the user.
The present invention is illustrated by way of example and not limited in the accompanying figures in which:
The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document in conjunction with other present or future technologies. Although the terms “step” and/or “block” may be used, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed. Various aspects of the technology described herein are generally directed to systems, methods, and computer-readable storage media for, among other things, identifying queries that correspond to certain items or entities. In embodiments, items or entities can include objects such as people, places, characters, and products, such as goods or services, etc., as more fully described below.
Embodiments of the present invention associate search queries or search query terms with particular entities. Multiple entities (that is, entity identifiers) and multiple website addresses (Uniform Resource Locators or URLs) are received by the system. At least a portion of the received website addresses are associated with a particular entity. To identify a particular entity associated with a particular website address (and thus with a particular entity), the system logs search terms and selections made by users, and associates particular search terms with particular entities based on the user selections. In an embodiment, a quantity of user selections of particular website addresses are logged. An identity of a user (or client computing device) making a user selection may also be logged such that a maximum quantity of user selections made by the same user or client computing device may be logged, if desired. In embodiments, information is selected for display based on a search term and its association with a particular website address and, thus, a particular entity.
As more fully described below, embodiments include computer-readable storage media storing instructions that cause one or more devices to select a disambiguated name for an entity. A server, indexer, or crawler-type component receives web pages associated with entities and a set of queries associated with the web pages. The entities may be proper nouns, people, places, characters, titles, slogans, or products, or the like. Such entity identifiers are not intended to limit the scope of embodiments of the present invention, however. Any and all such variations, and any combination thereof, are contemplated to be within the scope of embodiments hereof. In an exemplary embodiment, a first search query is identified as associated with a particular URL (and thus a particular entity) based on user selections of web pages after the first query has been executed. In embodiments, user selections may be weighted.
In embodiments, the first query may be ranked as the highest query associated with an entity, and at least a portion of the query may be stored as a disambiguated name for the entity. The first query can be ranked higher, in embodiments, based on a first quantity of user selections of a particular web page compared to a second quantity of user selection of one or more other web pages associated with the first query. In embodiments, the first query may be used to retrieve an image for display. The image can supplement or accompany search results based on another query, such as a similar or related query, in order to provide an image associated with a particular entity.
In another embodiment, a method for identifying one or more search queries includes receiving a plurality of queries, including a first query, and receiving a plurality of URL selections associated with at least the first query. A subset of URL selections is determined for the first query, and a quantity of user selections that correspond to a first URL selection is determined. A ratio is determined of (1) the quantity of user selections corresponding to the first URL selection to (2) the total quantity of user selections associated with the query (the total quantity available in memory or within the relevant server logs, etc.). Either quantity of user selection may be filtered for noise and/or to filter the quantity of user selections origination from the same user or client computing system. In embodiments, a score is determined for each query with respect to the URL “Ui.” The score may be determined by multiplying each ratio by the quantity of user selections corresponding to the first URL selected.
For a second query, a second subset of URL selections is determined, which also includes the first URL selection (mentioned above). The quantity of URL selections corresponding to the first URL selection and the second query is determined. A second ratio is determined, which is the quantity of user selections compared to a total quantity of URL selections associated with the second query. A score is determined for the second query based on multiplying the second ratio by the quantity of URL selections corresponding to the first URL selection and the second query (determined above). The first and second queries may then be ranked relative to one another based on their respective scores. In response to a request for information about an entity or related to an entity, the first query can be executed. The request for information can be a request for an advertisement, such as a link, image, or product placement. A request can be made by a user or automatically by code or other instructions, based on an available advertising space in embodiments.
Accordingly, in one embodiment, a system is provided for associating search terms with entities. The system includes an entity-receiving component that receives a plurality of entities; an address-receiving component that receives a plurality of addresses, each of the plurality of addresses being associated with one of the plurality of entities; a logging component that logs one or more submitted search terms and one or more user selections; and an associating component that associates a first search term of the plurality of search terms with a first entity of the plurality of entities based on the one or more user selections.
In another embodiment, the present invention is directed to one or more computer-readable storage media storing computer-useable instructions that, when used by one or more computing devices, cause the one or more computing devices to perform a method for selecting a disambiguated name for an entity. The method includes receiving a plurality of web pages, each of at least a portion of the plurality of web pages being associated with a respective entity of a plurality of entities; receiving a plurality of search queries, each of at least a portion of the plurality of search queries being associated with a respective one of the plurality of web pages; determining that a first search query of the plurality of search queries is associated with a first entity based on one or more user selections of an associated web page of the plurality of web pages in response to execution of the first search query; ranking the first search query as the highest ranked search query associated with the first entity; storing said first search query as the disambiguated name for the first entity.
In yet another embodiment, the present invention is directed to one or more computer-readable storage media storing computer-useable instructions that, when used by one or more computing devices, cause the one or more computing devices to perform a method for identifying one or more search queries. The method includes receiving a plurality of queries including a first query; receiving a plurality of URL selections, each of the plurality of URL selections being associated with at least one query of the plurality of queries; for the first query, determining a first subset of URL selections; for a first URL selection of the first subset of URL selections, determining a first quantity of URL selections that correspond to the first URL selection and to the first query; and determining a first ratio of the first quantity of URL selections to a total quantity of URL selections associated with the first query.
Having briefly described an overview of embodiments of the present invention, an exemplary operating environment in which embodiments of the present invention may be implemented is described below in order to provide a general context for various aspects of the present invention. Referring to the figures in general and initially to
Embodiments of the invention may be described in the general context of computer code or machine-useable instructions, including computer-useable or computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, and the like, and/or refer to code that performs particular tasks or implements particular abstract data types. Embodiments of the invention may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, and the like. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
With continued reference to
The computing device 110 typically includes a variety of computer-readable media. Computer-readable media may be any available media that is accessible by the computing device 110 and includes both volatile and nonvolatile media, removable and non-removable media. Computer-readable media comprises computer storage media and communication media; computer storage media excluding signals per se. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 110.
Communication media, on the other hand, embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
The memory 114 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, and the like. The computing device 110 includes one or more processors that read data from various entities such as the memory 114 or the I/O components 122. The computing device 110 can be in communication with exemplary client devices 122 and 124 through any type of wired or wireless connection 126, including the Internet or an intranet.
The I/O ports 118 allow the computing device 110 to be logically coupled to other devices including the I/O components 120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, and the like. Aspects of the subject matter described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a mobile device. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. Aspects of the subject matter described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Furthermore, although the term “search engine” may be utilized used herein, it will be recognized that this term may also encompass a server, a Web browser, a set of one or more processes distributed on one or more computers, one or more stand-alone storage devices, a set of one or more other computing or storage devices, a combination of one or more of the above, and the like.
The client devices 122 and 124 include interface displays 128 and 130, respectively. Exemplary interface displays include screens, speakers, printing components, and the like. The interface displays 128 and 130 may be remote from client devices 122 and 124. In an embodiment, computing device 110 has access to stored information, including source or entity URL information 132, query information 134, and click count information 136. The entity URL information 132, query information 134, and the click count information 134 can be stored at the computing device 110, or made available based on a connection to, or request from, the computing device 110. The information can be remote, from a third-party, or anonymous, and it can be obtained at any time.
In an embodiment, the query information 134 and click count information 136 are obtained or requested from one or more remote databases 138, 140. As illustrated, the computing device 110 includes an entity-receiving component 142, an address-receiving component 144, a logging component 146, an associating component 148 and an information selection component 150. One or more components described herein can be located on one or more computing devices, such as computing device 110, which can be distributed and/or available through remote connections.
As previously mentioned, embodiments of the present invention relate to systems, methods, and computer-readable storage media for, among other things, determining a query that corresponds to an entity. As discussed above, an item or entity can include objects such as people, places, characters, and products, such as goods or services. In one example, for the entities Will Smith (the actor) and Will Smith (the football player), preferred or effective queries are determined. For example, the highest ranked query for Will Smith (the actor) may be “Will Smith,” while the highest ranked query for Will Smith (the football player) may be “Will Smith defensive end.” These queries can be ranked based on the amount of times that users selected certain URLs, which are associated with certain entities, after submitting queries for one of the Will Smith entities.
In embodiments, certain URLs are known to be associated with certain entities, based on prior analysis, crawling, tagging information, or other stored information. These URLs can be considered entity URLs 132, or “source URLs,” with higher-quality content about the entity or a higher-confidence match with the correct entity (e.g., the URL for the actor Will Smith's official web page). These entity URLs 132 can be manually selected or selected based on search terms, which may later be refined by user feedback or other input. Server logs or other stored user-behavior information is analyzed for user's query information 134 and “click count” information 136 in embodiments of the present invention.
As shown below in Table 1, an analysis is performed for each “source URL” 132 represented by “Ui,” whereby all of the query information 134 associated with U, in the server logs (or other memory associated with executed searches and accessibly by computing device 110) is analyzed. Each query that was executed and resulted in a user selection of the URL “Ui” is analyzed to determine how many user selections occurred, from Query 1 (“Qi”) through any quantity of queries. The quantity of user selections of the URL “Ui” after each query is shown in column three (“Count”), and it is based on click count information 136. This count can be filtered to eliminate noise or multiple user selections from the same device (such as client devices 122 and 124), or to eliminate multiple selections from the same account or household. Additionally, the count can include weighted user selections based on a user's own account or history, users who speak the same language or live in a certain area, users with demographic commonalities, or other user or user-device activity. The click count does not necessarily involve literal “clicks” by a mouse; the click count indicates the quantity of user selections of certain web pages, links, or addresses by any method, including tapping, holding, voice commands, etc. The associations shown in Table 1 can be determined by computing device 110 for multiple URLs, based on query information 134 and click count information 136.
Referring now to
As shown in Table 2 below, for each query (e.g., “Q1”), the corresponding URL and click count information 136 is determined and shown in columns two and three. In the last column, a dedication ratio is determined for the query with respect to each URL. The dedication ratio can indicate how closely a query is associated with a URL, including a source or entity URL. The dedication ratio is determined according to a formula, where the dedication ratio is equal to the click count for one URL divided by the sum of click counts for all URLs associated with the same query. For example, the dedication ratio Ri, 1 is based on dividing C1 (the click count for the first URL) by the sum of all click counts for Q1 (for any URL). The associations shown in Table 2 can be determined by a computing device 110 for query “Qi” and multiple URLs. The amount of URLs or table entries or fields shown in Tables 1, 2 and 3 are scalable up to any amount. The associations in Table 2 can be determined for multiple queries based on query information 134 and click count information 136.
With reference to
As shown in Table 3 below, a selected or clicked-on URL, such as “Ui,” is associated with each query that was executed and resulted in a click on, or a user selection of, “Ui.” The URL “Ui” is also associated with a click count, as shown in column three of Table 3, and a dedication ratio (column four) and a dedication score (column five), discussed below. The data associations stored in Table 3 can be determined and repeated by a computing device for multiple queries (an unlimited amount from “Qi” through “Qm”) and multiple URLs. As more fully described with respect to
With reference to
Turning now to
In embodiments, the logging component 146 is further configured to log the quantity of the selections made by the users (as indicated at block 518), and to log a quantity of user selections for each of the one or more addresses (as indicated at block 520). In embodiments, the logging component 146 is also configured to log a quantity of user selections for each of the addresses based on considering a limited number of user selections for each of the client computing devices 122 and 124, as indicated at block 522. Selections from a client computing device can be filtered to limit the quantity of clicks considered per user or per user computer, or to limit non-unique or repeat visits. The clicks can be removed or filtered at the time of counting or data collection, or at the time that the quantity of clicks are considered (in other words, the clicks can be collected and filtered at a later time).
The exemplary system includes an associating component (shown at 148 of
With reference to
A first query is determined from a set of queries associated with the first entity, based on selections of a first web page after an execution of a query, as indicated at block 616. The selections of certain web pages after execution of a query can be stored in server logs, derived from server or search query logs, or obtained from other databases, such as databases (e.g., databases 138, 140 of
As indicated at block 618, the first and highest ranked query according to an embodiment is determined, based on comparing the quantity of selections of a first web page to a quantity of selections of the other web pages combined with the first quantity selections (in other words, comparing the quantity of selections of a first web page to the quantity of all selections of web pages for a particular query). The first query is ranked as the highest ranked query associated with the first entity in an embodiment, as indicated at block 620. The ranking can be based on selections of one or more certain web pages or website addresses after executing queries. The ranking can also be based on other factors or considerations, alone or in combination, such as click count information 136, and dedication ratios and/or scores based on click count information 136. As indicated at block 622, the first query is stored as the disambiguated name for the first entity.
In embodiments, the first query can be used to retrieve an image for display, as indicated at block 624 (for instance, utilizing information selection component 150 of
In this example, textual or multimedia search results are supplemented by an image, where the disambiguated query is used to request the image. As an example, see search results 912 and image 914 in
The exemplary method 700 in
As described above, for the entity Will Smith (the actor) and the entity Will Smith (football player), the most preferred or most effective query for each of these entities can be determined. By analyzing query information 134 and click count information 136, it can be determined which query is the most likely to lead to the entity Will Smith (football player). For users that clicked on URLs known to be associated with Will Smith (football player), the quantity of user selections can be analyzed, and the underlying queries submitted by the users can be analyzed. By calculating the dedication ratio and the dedication score as described above, it can be determined that the query “Will Smith defensive end” is the most preferred query for obtaining information about the entity Will Smith (football player) in an embodiment of the present invention.
Several search terms or queries for entities can be ambiguous or yield search results associated with more than one entity, even among different types of entities. For example, the query “George Washington” can be ambiguous with respect to the first president of the United States and the university with the same name. In another example, the query “Hotel California” can be ambiguous with respect to the song by that title and the move with the same name. In some cases, only one possible interpretation may be associated with an entity that is a proper noun. Embodiments of the present invention can be used to determine the most preferred or most highly-ranked query for the entity that is a proper noun (or, alternatively, for a non-proper noun entity). For example, the search term “tide” could be associated with the natural phenomenon of the ocean tides or the laundry detergent, Tide®.
In embodiments, the query information 134 and the click count information 136 can continually be updated based on new information, in order to provide dynamic dedication ratios and scores. In embodiments, any clicks that are associated with an overriding of, or a disagreement with, the most preferred query for an entity can be used as feedback to update ratios and scores (and can be weighted with respect to one user or client device, with respect to users in a certain area or that fulfill certain other criteria, or with respect to all users). Embodiments of the present invention can designate areas or users as affected by language-based nuances or preferences, which can affect the scores or the weighting of scores when determining preferred queries. In one example, clicks by certain users are weighted based on demographic information, such as commonalities with a current user, such as being in the same age group or of the same gender.
Queries or search terms may also be weighted or otherwise affected by additional criteria during use of embodiments of the invention. For example, queries can be weighted by length, uniqueness, amount of languages used, reading level, or the presence or strength of additional terms. In embodiments, a query or search term can be one word in length or consist of more than one word, including phrases, distinct terms, and/or numerical or non-alpha-based characters.
Embodiments of the present invention include determinations by computing device 110 regarding preferred or effective query information. Effective queries can be the most likely to lead to a link relating to the correct entity, or a photo or image relating to the entity, and the queries can be used to identify advertising opportunities for product or service entities (or location or title-based entities, such as cities to visit and books or movies to purchase). In embodiments, the preferred or optimized query information can be obtained without the need to crawl content on web pages, saving server time and energy.
An optimal query can be used to generate images for further selection by a user who is searching for an entity, or to generate a photo or image for display next to a search result. For example, during a search for Will Smith and any football associated term (such as a search for “Will Smith football”), the preferred query “Will Smith defensive end” could be used to request a link to content, an image of Will Smith, or an advertisement related to Will Smith (football player). The queries can be used to create a disambiguation page or index, or to cluster relevant results close to each other or in an organized manner.
In an exemplary embodiment, a search has been executed by a user, which returned search results 912. The top or prominent search result 916 can be based on a disambiguated query. In an embodiment, image 914 is based on the disambiguated query, while the remaining search results 912 are based on the ambiguous or original query. The entity URL 810 in
As can be understood, embodiments of the present invention provide systems and methods for disambiguating entity names. The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present invention pertains without departing from its scope.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.
It will be understood by those of ordinary skill in the art that the order of steps shown in the exemplary methods of