The present invention relates generally to the field of a search engine in a computer network system, in particular to a system and method of customizing rankings of search results in response to search queries submitted by members of one or more user groups.
Search engines are powerful tools for locating and retrieving documents from the Internet (or an intranet). Traditionally, the search results produced by a search engine are independent of the user who issued the search query. For example, the search engine generates the same search result for the search query “apple” irrespective of whether the search query is from users interested in Apple® computers or the fruit malus domestica. Clearly the search results returned for the search query “apple” are likely to include some results of little interest to these respective groups of users.
In view of the aforementioned, it would be desirable to have a search engine that can customize its search results so as to highlight information items in the search results that are most likely to be of interest to the users who submit the search queries. Further, it would be desirable for such a system to operate without explicit input from a user with regard to the user's personal preferences and interests, and for the system to protect the privacy interests of its users.
In some embodiments, a computer-implemented method associates a plurality of groups with a user. Each group may have at least one profile. The method also includes receiving a search query from the user and identifying information items associated with the search query. The method computes adjusted scores for the information items based on the groups' profiles, and ranks the information items accordingly before providing the ranked information items to the user.
In some embodiments, a computer-implemented method associates a group having a plurality of profiles with a user. The method also includes receiving a search query from the user and identifying information items associated with the search query. The method computes adjusted scores for the information items based on the group's profiles, and ranks the information items accordingly before providing the ranked information items to the user.
Some embodiments may be implemented on either the client side or the server side of a client-server network environment.
The aforementioned features and advantages of the invention as well as additional features and advantages thereof will be more clearly understood hereinafter as a result of a detailed description of preferred embodiments when taken in conjunction with the drawings.
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
In some embodiments, the information server 106 contains a subset or superset of the elements illustrated in
A website 102 is typically a collection of webpages associated with a domain name on the Internet. Each website (or webpage) has a universal resource locator (URL) that uniquely identifies the location of the website (or webpage) on the Internet. Any visitor can visit the website by entering its URL in a browser window. A website can be hosted by a web server exclusively owned by the owner of the domain name or by an Internet service provider wherein its web server manages multiple websites associated with different domain names. For illustrative purposes, the website 102 includes webpage 116, which may have an associated search box. From the search box, a visitor to the webpage 114 can search the website 102 or the entire Internet for relevant information by entering a search query into the search box. Depending on the context, the term “website” as used in this document refers to a logical location (e.g., an Internet or intranet location) identified by a URL, or it refers to a web server hosting the website represented by the URL. For example, some “websites” are distributed over multiple Internet or network locations, but have a shared web server hosting those locations, and in many situations it is logical to consider those network locations to all be part of “a website.”
A client 103 can be any of a number of devices (e.g., a computer, an internet kiosk, a personal digital assistant, a cell phone, a gaming device, a desktop computer, or a laptop computer) and can include a client application 132, a client assistant 134, and/or client memory 136. The client application 132 can be a software application that permits a user to interact with the client 103 and/or network resources to perform one or more tasks. For example, the client application 132 can be a browser (e.g., the computer program available under the trademark Firefox®) or other type of application that permits a user to search for, browse, and/or use resources (e.g., webpages and web services) at the website 102 from the client 103 and/or accessible via the communication network 104. The client assistant 134 can be a software application that performs one or more tasks related to monitoring or assisting a user's activities with respect to the client application 132 and/or other applications. For instance, the client assistant 134 assists a user at the client 103 with browsing for resources (e.g., files) hosted by the website 102; processes information (e.g., search results) received from the information server 106; and monitors the user's activities on the search results. In some embodiments the client assistant 134 is part of the client application 132, available as a plug-in or extension to the client application 132 (provided, for example, from various online sources), while in other embodiments the client application is a stand-alone program separate from the client application 132. In some embodiments the client assistant 134 is embedded in one or more webpages or other documents downloaded from one or more servers, such as the information server 106. Client memory 136 can store information such as webpages, documents received from the information server 106, system information, and/or information about a user.
The communication network 104 can be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet. It is sufficient that the communication network 104 provide communication capability between the websites 102, the clients 103 and the information server 106. In some embodiments, the communication network 104 uses the HyperText Transport Protocol (HTTP) to transport information using the Transmission Control Protocol/Internet Protocol (TCP/IP). The HTTP permits client computers to access various resources available via the communication network 104. The various embodiments of the invention, however, are not limited to the use of any particular protocol. The term “resource” as used throughout this specification refers to any piece of information or service that is accessible via a URL and can be, for example, a webpage, a document, a database, an image, a computational object, a search engine, or other online information service.
In order to receive group-dependent search results, a user from client 103 (for example) may send to a website 102 a request for a webpage. The website responds by identifying the requested webpage and returning it to the requesting client 103. The webpage may include a document of interest to the user (e.g., a newspaper article). The webpage may also include a search box (e.g., at or near the top of the webpage). While or after browsing the content of the webpage, the user may be interested in getting more information. To do so, the user can enter a search query and submit the search query to the website 102, a search engine, or the like. The search query may include one or more query terms.
As noted above, many websites do not have a dedicated search engine. Their search requests are actually handled by a third-party search engine. In some embodiments, upon receipt of the search query, the website 102 generates and sends a search request to the information server 106. In some other embodiments, the client 103 generates and sends the search request directly to the information server 106 without routing the request through the website 102. In other embodiments, the user may choose to use the website of a third party search engine directly. In any case, the search request may include the search query and unique identifiers of one or more of the following entities: the website 102 being viewed, a website previously viewed by the user, the requesting user, and the requesting client 103. The identifier for a website may be a URL of a particular web page, or a prefix portion of the URL that identifies the website or a portion of the website. The search engine 122 or other portion of the information server 106 may determine the appropriate portion of the URL to use for determining a group associated with the user.
Within the information server 106, the front end server 120 is configured to handle a variety of requests from the websites 102 and the clients 103 via their respective connections with the communication network 104. As shown in
The front end server 120 passes the search request onto the search engine 122. The search engine 122 then communicates with the content database 124 to select a plurality of information items (e.g., documents) in response to the search request. The search engine 122 assigns a generic ranking score to each information item based on the item's page rank, the text associated with the item, and the search query. For ease of discussion, information items will often be referred to as “documents;” but it is to be understood that information items need not be documents, and may include other types or forms of information.
The search engine 122 is also connected to the document profile database 123. The document profile database 123 may store a document profile for each indexed document in the content database 124. Both the document profile database 123 and the content database 124 are connected to the document profiler 125. For each document in the content database 124, the document profiler generates a document profile by analyzing the content of the document and its link structure. The generation of document profiles is independent of the operation of the search engine 122. In some embodiments, the document profiler 125 is invoked to generate a document profile whenever the information server 106 identifies a new document or a new version of an existing document on the Internet. In other embodiments, the document profiler 125 is invoked periodically to generate document profiles for all new files identified during a predetermined time period. In some embodiments, instead of being two separate entities, the document profile database 123 and the content database 124 are merged together so that a document and its associated profile can be located by a single database query.
There is a connection from the search engine 122 to the search result ranker 126. Through this connection, the search engine 122 sends the identified documents and their associated document profiles to the search result ranker 126. The search result ranker 126 has a connection to the group profile database 128. Like the document profile database 123, the group profile database 128 stores a large number of group profiles including group profiles of one or more groups associated with a requesting user. For example, the search result ranker 126 may use a group profile associated with users of a website 102 (e.g., a website currently or recently visited by the user) to convert the generic ranking score of each identified document into a group-dependent ranking score. The documents are then re-ordered in accordance with their respective group-dependent ranking scores. Next, the search result ranker 126 creates a search result in accordance with the updated order of the documents. The search result includes multiple document links, at least one for each document. The search result, or a portion of the search result (e.g., information identifying the top 10, 15 or 20 information items or documents), is returned to the requesting client 103 and displayed to the user through the client application 132. The user, after browsing the search result, may click one or more document links in the search result to download and view one or more documents identified by the search result.
While the above description divided tasks among the search engine 122, search result ranker 126 and front end server 120 in a particular way, this particular division of tasks is exemplary, and other divisions may be used in other embodiments. For instance, the group profile (associated with the user from whom a search query is received) may be transmitted with the search query to the search engine 122, and the search engine 122 may use that information to compute group specific document scores for ranking the search results. In effect, this would merge the search result ranker 126 into the search engine 122. Alternately, an identifier of the group profile may be transmitted with the search query to the search engine 122. If the search engine 122 has a copy of the group profile or has access to the group profile, it can then use that information to compute group specific document scores. In yet other embodiments, other divisions of tasks may be used.
An important aspect of the process of serving group-dependent search results is the generation and maintenance of the group profiles stored in the group profile database 128. A group profile should reflect the interests of the users of the associated group, and in many embodiments the group profile will be unique to its associated group. For example, users of a consumer electronics website should have a group profile that boosts webpages related to electronic products while users of an on-line grocery store should have a group profile that promotes webpages related to food. Users of both the consumer electronics website and the on-line grocery store may be associated with both groups.
In most embodiments, a group profile is not static, because a static group profile is unlikely to result in the information server 106 serving the most relevant search results to users of the associated group. Instead, a group profile is updated from time to time, (i.e., periodically) so as to re-align the group profile with the current interest of the users associated with the group. While some group profiles may remain virtually static for long periods of time (e.g., for groups associated with a small, static population of users, and/or users having interests focused on a very narrow range of topics), many group profiles will vary over time as the associated users change and as the users' interests vary over time. In an exemplary embodiment, a respective group profile may be based on dated information, with older information receiving lower weightings than newer information when constructing a vector or other representation of the group profile. For example, the information for each successively older time period may be down weighted by a predefined scaling factor, so that information from a period that is more than N (e.g., a value between 5 and 20) periods old has less than half the impact on the group profile as information from the current period.
In some embodiments, the profile for a particular group may include weights depending on the number of “clicks” or visits by the users to a particular website, webpage, or set of websites during a particular window in time. For example, if users associated with a particular location group (e.g., Alexandria, Va.) frequently visit the website of a local park (e.g. Mason Neck State Park), the weight for the local park website within the local group profile would be increased. In some embodiments, the weights in a group profile may be associated with one or more half-lives or other time modulations to allow the information server to provide the most relevant and timely search results. For example, “clicks” on the Mason Neck State Park website may be more frequent on Fridays and Saturdays in the months of April through October than at other times, so the weight for the Mason Neck Park website for users associated with the Alexandria, Va. group might be higher for a search done on a Friday in July than for a search done on a Tuesday in December. In another example, users associated with a consumer electronics group may “click” on information about Shakira at a particularly high frequency at a particular time, but the frequency of “clicks” may decay over a year, a week, a day, a minute, or other time period. In some embodiments, the time decay rate of “clicks” is stored in the form of half lives over several different periods of time. A group profile can also be based on characteristics other than user clicks. Other characteristics of user behavior that can be used in a group profile include one or more of the following: the length of time that a user interacts with the website, the proportion of the website viewed by the user, actions (in addition to clicks) taken by a user while visiting the website (e.g., printing, bookmarking, cutting and pasting, annotating), and a user's activity subsequent to the interaction with the website.
There are similarities between a group profile and a user profile, as described above. Both profiles can be used to finely tune the search results generated by the search engine. Both need information about at least one user's search history in order to capture the user's dynamic search interest. But there are also significant differences between the two types of profiles. A typical user profile is generated by analyzing an individual user's search history. This user profile is only used to modulate search results responsive to search queries submitted by the same user. For the same search query, two different users may receive different search results from the same search engine if they have different user profiles. In contrast, a group profile is usually generated by analyzing the search history of multiple users so as to characterize the multiple users' interests. The size of a group may range, for example from a predefined minimum value to approximately one million, or as many as the number of users for whom characteristics may be reasonably stored. The minimum size of a group may be selected or determined so as to preserve the privacy interests of the group members. A group profile can be used to modulate search results responsive to search queries submitted by any user from the same group, including new users of the group who made no prior “contribution” to the group profile. Therefore, the same user submitting the same search query from two different locations (e.g. websites, computers, ip addresses, or geolocations) may receive different search results if the two locations have different group profiles.
The group profile also has an important advantage over the user profile in terms of protecting a user's privacy. Thus, in some embodiments, a group profile is based on information associated with at least a sufficient number of users to protect the privacy of those users. In some embodiments a group profile contains at least approximately 256 users to protect individual users' privacy, while in other embodiments the minimum size of a respective group is a value between ten and 512, while in another embodiment the minimum size is between ten and 10,000). A user profile is associated with an individual user. To create the user profile, the individual user, either explicitly or implicitly (e.g., by monitoring or logging search queries and other online activities of the user), needs to complete a survey of his or her personal preferences. This survey indicates what information items may be of interest to the user. Further, the user must have an account at a website or a search engine system and the user must log into his or her account to invoke the user profile to personalize the search results. In contrast, the creation and usage of the group profile does not require any personal information from any user. A group profile is associated with a group of users, not an individual user. Any individual user's activity is attributed to all the users in the corresponding group or groups. A user does not need to log into his or her account at the website in order to use the group profile. As long as the user may be associated with a group, the information server automatically “personalizes” the corresponding search result in accordance with the group profile.
As shown in
In some embodiments, when the front end server 120 receives a search, it submits a copy of the search query to the search engine 122 in order to solicit a search result. In addition, the front end server 120 sends another copy of the search query to the search history database 127. The search history database 127 then generates a record, the record including at least the search query and one or more group identifiers or other information from which one or more group identifiers can be derived.
The search result ranker 126 prepares search results responsive to the search query. The search result (i.e., information representing at least a portion of the search results) is sent back to the requesting client through the front end server 120. A copy of the search result, or a portion of the search result, is also stored in the search history database 127 together with the search query record. The client assistant 134 at the requesting client monitors the requesting user's activities on the search result, e.g., recording the user's selection(s) of the document links in the search result and/or the mouse hovering time on different document links. In some embodiments, the client assistant 134 or the group profiler 129 determines the document “dwell time” for a document selected by the user, by determining the amount of time between user selection of the corresponding document link and the user exiting from the document. In some embodiments, the client assistant 134 includes executable instructions, stored in the webpage(s) containing the search result, for monitoring the user's actions with respect to the search results and transmitting information about the monitored user actions back to the information server 106. In some embodiments, the search results are served to the requesting users with an embedded client assistant 134 that sends information about the user activities on the search results to the group profiler 127. The information server 106, in turn, stores information about these user activities in the search history database 127 for subsequent use. The search history database 127 may allocate amounts of storage space for different groups. As a result, the volume of search history associated with a group does not exhaust its designated space or waste too much space before the next scheduled profile updating.
For example, the group profiler 129 records the moment that a user submits a search query (t0), the moment that the user clicks the first document link in the corresponding search result (t1), and the moment that the user clicks the second document link in the search result (t2), etc. The differences between two consecutive moments (e.g., t1−t0 or t2−t1) are reasonable approximations of the amount of time spent viewing the search result or the document whose link was selected by the user. In some embodiments the group profiler 129 has no information about the user's dwell time for the last document in the search result that the user selects for viewing. In some other embodiments (e.g., where at least some users “opt in” to a version of the client assistant that collects additional information about the users' online activities), the group profiler 129 also receives click and timestamp information for user actions after the user finishes viewing documents from a search result. Continuing the above example, the group profiler 129 further records the moment that the user submits a second query (t3), the moment the user selects a document from the second search results (t4), and so on. Furthermore, the group profiler 129 may record the moment (t5) when the user either closes the browser window that was being used to view search results and documents listed in the search results or navigates away from the webpage or document with the search results. This additional information enables the group profiler 129 to determine the user dwell time for all search result documents (i.e., documents listed in search results) viewed by a user, which in turn enables the group profiler 129 to generate a more accurate group profile for a group.
Based on characteristics associated with one or more users, the group profiler 129 generates a group profile.
The group profiler 129 may identify online activities of users. Identified user activities may include user clicks on document links in search results. In another example, identified user activities may include mouse hovering time on the document links. Generally speaking, a user clicks a document link if the user is interested in the document's content. Similarly, the fact that the mouse moves onto a particular document link and stays there for a substantial amount of time indicates that this document is relevant to the user's interest. In some embodiments, information about the mouse hovering time may be unavailable.
From the user activities on different search results, the group profiler 129 can identify documents selected by the users. In some embodiments, the group profiler 129 visits the content database 124 to retrieve the profiles of the corresponding documents. As noted above, each identified document may have a profile (e.g., a document profile) that was previously generated. If any of the identified documents do not yet have profiles, those documents can be ignored, or the group profiler may call upon the document profiler 125 to produce document profiles for those documents.
In these ways, the group profiler 129 may also identify other users having one or more of the same characteristics (215). User characteristics may be correlated by topic, time or the like to form groups. The group profiler 129 then associates the user with other users having the same characteristics in at least one group (220). One or more group profiles are then generated (230). The group profiler 129 may generate a group profile based on the user characteristics, the retrieved document profiles, or the like. Some of the group profiles may be website profiles, the generation and use of which are described in patent application Ser. No. 11/394,620. A group profile may be validated, for example, by comparing the group to an average of other groups of the same type, by averaging the differences between groups, or other validation procedures. The group profile may include one or more of the following: a weighted listing or vector of categories (sometimes called a category-profile), key terms from search queries and/or user visited documents (sometimes called a term profile), and information about links to user visited documents (sometimes called a link profile). The group profile is stored in the group profile database 128. The search result ranker 126 can retrieve the group profile to re-order the ranks of the documents within a search result.
In some other embodiments, operation 230 may include a clustering operation in which user characteristics are clustered using statistical analysis to determine suitable groupings. The clustering may be based on, for example, the fact that a user clicks a link. Alternatively, the group profiler may directly match a document's URL against a known set of URLs associated with a particular category. In either case, the group profiler 129 does not need to access the documents' contents in order to generate the group profile.
In yet other embodiments, operation 230 may be augmented by a process that maps the user characteristics to a set of categories. For example, the categorization of queries can be based on the terms in the queries themselves, or by accessing the profiles of the top N search results (e.g., the top 5, 10, 15 or 20 search results) produced by those queries, merging those document profiles to produce a query profile for each query, and merging the query profiles, weighted in accordance with their frequency of submission to generate a group profile. As discussed below with reference to
As noted above, a group profile may be updated from time to time in order to keep track of the current interests of the users associated with the group. In some embodiments, a group profile is updated at a predetermined time interval (e.g., every week or every day). In some other embodiments, a group profile is updated whenever the number of new search queries by members of the group reaches a threshold value since a last (i.e., most recent) update. Whenever it is time to update the group profile, the group profiler 129 repeats the aforementioned process to update the group profile.
In some embodiments, different groups generate substantially different magnitudes of traffic and therefore should be treated differently in terms of profile updating. For instance, a group associated with a popular domain name may generate heavy traffic, on the order of tens of thousands of clicks per day while a smaller group may have a much lower click rate.
Some groups may generate so much traffic that the group definition should be refined. For example, a group based on grouping the users' ip addresses in a range could include users of a proxy, which would be unlikely to have any correlation with user's interests. In such a case, the outlier ip address (the suspected proxy ip address in this example) could be excluded from the group. There are two additional issues with significant traffic during a short time period. First, the group's profile may be biased by this traffic peak. Special care may be required to make sure that the group profile has an appropriate balance between the short-term and long-term interests of the users, such as by excluding or down-weighting the associated elements of the group profile. Second, the search history database 127 may not have the space to store all the search history. One approach to solve this issue is to intentionally ignore some of the search queries, search results and user activities. This may be accomplished by sampling the search queries, search results and/or user activities so as to produce an unbiased sample of the search history. While the extent of the sampling may vary from one embodiment to another, experiments suggest that a search history encompassing several months of user activities will have sufficient data to generate a reliable group profile, for most groups, so long as (A) the sampling is done in a manner that avoids significant biases, and (B) it includes user activity data corresponding to a few weeks of representative search history.
Alternatively, the space shortage issue can be solved by generating a series of incremental group profiles for different portions of the search history and merging the incremental group profiles into the group profile. As shown, for example, in
A group profile may be used for anonymously “personalizing” search results responsive to search queries submitted by a user associated with a group. An underlying assumption in the present specification is that a user's search queries are, more or less, related to at least one of the groups associated with the user. If not carefully filtered out, the search history associated with popular, but irrelevant, terms may seriously “contaminate” the group profile and twist the search results in an unexpected direction. Another source of contamination of the group profile is query terms that have very low popularity. Special treatment may be necessary to make sure that user activities with respect to very low popularity query terms do not significantly bias the search results.
In some embodiments the group profiler 129 (
There are multiple factors determining the contribution of a search query (or a corresponding search result) in the middle category 420 to the group profile. For example, the popularity of the search query and the amount of user activities on the search result affect the contribution of the search query and the search result on the group profile. Time is another important factor. In some embodiments, recent search history plays a more prominent role than less recent search history in the formation of the group profile. One skilled in the art can easily apply similar principles to other aspects of the search history associated with the group.
In some embodiments, group profile generation is divided into multiple sub-processes. Each sub-process produces a specific type of group profile characterizing the interests of the website users from a particular perspective. Four examples of the types of group profiles that may be produced by sub-processes of the group profiler 129 (
In some embodiments, the website_i group 530 may have only a subset of the group profiles 531, 533, 535, and 537. For example, the website_i group 530 may include a single term-based group profile 533. In some embodiments, the group 530 includes a plurality of group profiles. In some embodiments, at least one of plurality of group profiles is a combination of two or more of the aforementioned group profiles 531, 533, 535, and 537. In some other embodiments, the category-based, term-based and/or link-based group profiles are further processed to generate a cluster-based group profile. In yet other embodiments, the cluster-based group profile appears in the form of multiple cluster-based sub-profiles characterizing different aspects of the group. In other embodiments, the cluster based group profile 537 is generated independently of the category-based, term-based, and link-based group profiles using statistical methods.
The category-based group profile 531 may be constructed, for instance, by mapping search history items (e.g., search queries, content terms, and/or user-selected documents) to categories, and then aggregating the resulting sets of the categories and weighting the categories. The categories may be weighted based on their frequency of occurrence in the search history items. In addition, the categories may be weighted based on the relevance of the search history items to the categories. The search history items accumulated over a period of time may be treated as a group for mapping into weighted categories. Other suitable ways of mapping the search history into weighted categories may also be used. In addition, category-based group profiles may also be based on, or take into account, information about the language(s) used by the websites visited by a group of users, the reading level of such websites, and other characteristics of the websites that may be used for re-scoring search engine search results.
The categories shown in
Different profiles for the same time period can be generated in different ways, to reflect different aspects of the users in the group (e.g., short term, medium term, or long term interests). This may be accomplished by putting different emphasis or different weighting on different portions of the data used to generate the group profiles. Some group profiles may be generated for special time periods, such as holidays and events (e.g., Christmas, Olympics, etc.) during which the behavior of users may change significantly. The data for such special time periods may also be removed, or down-weighted, when generating “regular” profiles for a group of users.
In some embodiments, the search history items are automatically classified in different clusters. Clusters may be more dynamic than categories, since categories are typically pre-generated. Search history items associated with different groups are classified against the same set of categories. In contrast, there may not be a predefined set of clusters for a particular group. The search history may fall into a dynamically generated set of clusters. Therefore, clusters may be better tailored to characterize the interests and preferences of the group's users or provide additional information about a group to improve the customization of search results. For convenience, many of the discussions of profiles in this document use categories as an example. But it will be clear to one skilled in the art that the underlying algorithms are also applicable to clusters with no or little adjustment.
A category-based group profile, based upon the category map 600, is a topic-oriented implementation of a group profile. The items in a category-based profile can also be organized in other ways. In one embodiment, the interests of the website users can be categorized based on the formats of the documents identified by the website users, such as HTML, plain text, PDF, Microsoft Word, etc. Different formats may have different weights. In another embodiment, the interests of the website users can be categorized according to the types of the identified documents, e.g., an organization's homepage, a person's homepage, a research paper, or a news group posting, each type having an associated weight. Documents can also be categorized by document origin, for instance the country associated with each document's host. In yet another embodiment, two or more of the above-identified category-based profiles may co-exist, with each one reflecting a respective aspect of the interests of the website users.
Each group table 672 may store a group confidence value record 674 for the group, indicating the weight (also called the confidence value) of the group in the re-ranking process. In some embodiments, all groups begin with a default weight or group confidence value of one. Then modifications may be made to the group confidence value to reflect the appropriate weight of the group as associated with the user. For example, if a user is loosely associated with a group, the group confidence value may be lowered. Similarly, if a group is incoherent or has a low coherence, the group confidence value may be lowered. (Here coherence is used in the technical sense to refer to the degree to which members of a group are similar to other members of the group and dissimilar to members of other groups.) Group confidence values may also vary by group type. For example, location type groups may generally have higher (or perhaps, lower) confidence values than website type groups. Group confidence values may also vary by any other factor that affects the preferred weight of a group in computing the adjusted score for documents in the search result, including by query type (i.e., varying the group confidence value when computing the adjusted scored in response to an image query as opposed to map query, etc.). For example, a high traffic internet address group may reflect the users' proxy server, which is unlikely to have any significant relationship to users' interests and should therefore have a low confidence value relative to groups that more consistently reflect their users' interests. In some cases, when the group confidence value approaches or reaches zero, the user is no longer associated with the group.
Each group table 672 may also include profile confidence value records for every group profile of each group, for example 676-1 and 676-2 for Group 1, 676-3 and 676-4 for Group 2, and 676-2X-1 and 676-2X for Group X. Each profile confidence value record 676 may include a profile confidence value (e.g., Profile 1 Conf. Value) indicating the weight of each profile as associated with the group in the re-ranking process for a particular user; a unique identification (ID) of the profile associated with each group (e.g. Grp1-Prf1 ID); and a pointer to a group profile table, such as category-based group profile table 650, term-based group profile table 700, link-based group profile table 800, or the like. In some embodiments, all profiles begin with a default profile confidence value of one. Then modifications may be made to the profile confidence value to reflect the appropriate weight of the profile as associated with the user and the group as described with respect to the group confidence value above. For example, group 1, based on website i, may have a short term profile, profile 1, and a long term profile, profile 2. The profile confidence value for the short term profile 1 may be raised for a news query when website i is www.cnn.com based on the assumption that current news is more interesting to users associated with www.cnn.com. In some cases, when the profile confidence value approaches or reaches zero, the profile is no longer associated with the user or the group.
In some embodiments, an information server stores information identifying a limited number of groups per user. In addition, one or more groups may be associated with a user on the fly, for instance based on the user's IP address or the website from which the user is submitting a search query. Any groups not associated with a user are implicitly assigned a default confidence value of zero. In some embodiments, group memberships (i.e., groups associated with a user) are updated from time to time based on “recent history” (new data concerning recent online behavior of the user). This may include increasing the confidence of groups for which there is evidence of continued membership, decreasing the confidence of groups for which there is a lack of evidence of continued membership or evidence of decreased activity. A group may be removed from the set of groups associated with a user when the confidence value of the group falls below a predefined threshold value (e.g., 0.2 or any other appropriate value). Furthermore, since some group membership information may be determined on the fly, confidence values for those groups may be based on other information, such as a website's coherence value (see definition of coherence value, above).
Besides term-based and category-based profiles, another type of group profile is referred to as a link-based profile. As discussed above, the page rank of a document is based on the link structure that connects the document to other documents on the Internet. A document having more links pointing to it is often assigned a higher page rank and is therefore deemed more popular by the search engine. Link information of documents selected by users can be used to infer the interests of the users. In one embodiment, a list of preferred URLs is identified for users by analyzing the click rate of these URLs. Each preferred URL may be further weighted according to the mouse hovering time by the users at the URL. In another embodiment, a list of preferred web hosts is identified for the users by analyzing the users' visit rate at different web hosts. When two or more preferred URLs are related to the same web host, the weights of the two or more URLs may be combined as the weight of the web host.
A preferred list of URLs and/or hosts includes URLs and/or hosts that have been directly identified by the users. The preferred list of URLs and/or host may further extend to URLs and/or hosts indirectly identified by using methods such as collaborative filtering or bibliometric analysis, which are known to one of ordinary skill in the art. In one embodiment, the indirectly identified URLs and/or hosts include URLs or hosts that have links to/from the directly identified URLs and/or hosts. These indirectly identified URLs and/or hosts are weighted by the distance between them and the directly identified URLs or hosts. For example, when a directly identified URL or host has a weight of 1, URLs or hosts that are one link away may have a weight of 0.5, URLs or hosts that are two links away may have a weight of 0.25, etc. This procedure can be further refined by reducing the weight of links that are not related to the topic of the original URL or host, e.g., links to copyright pages or web browser software that can be used to view the documents associated with the user-selected URL or host. Irrelevant Links can be identified based on their context or their distribution. For example, copyright links often use specific terms (e.g., “copyright” and “All rights reserved” are commonly used terms in the anchor text of a copyright link); and links to a website from many unrelated websites may suggest that this website is not topically related (e.g., links to the Internet Explorer® website are often included in unrelated websites). The indirect links can also be classified according to a set of topics or categories and links with very different topics or categories may be excluded or be assigned a low weight.
The types of group profiles discussed above are generally complementary to one another since different profiles characterize the interests of users from different vantage points. However, this does not mean that one type of group profile, e.g., the category-based profile, is incapable of playing a role that is typically played by another type of group profile. By way of example, a preferred URL or host in a link-based profile is often associated with a specific topic, e.g., finance.yahoo.com is a URL focusing on financial news. Therefore, what is achieved by a link-based profile that comprises a list of preferred URLs or hosts may also be achievable, at least in part, by a category-based profile that has a set of categories that cover the same topics covered by preferred URLs or hosts.
Each group associated with the requesting user is identified (922). In some embodiments, some or all of the requesting user's group identifier(s) are embedded in the search query by the client assistant 134 or other means. Based on the group identifier(s), the search result ranker 126 identifies the associated group profiles in the group profile database 128 (925). For each document identified by the search engine 122 the search result ranker identifies a document profile (930), based on which a generic ranking score is derived.
Next, the search result ranker 126 analyzes each identified document to determine one or more boost factors using the group and document profiles (935) and then assigns the document a group-dependent ranking score using the document's generic ranking score and the boost factors (940). The search result ranker 126 iterates the process for every identified document (942). The search result ranker 126 re-orders the list of documents according to their group-dependent ranking scores (945) to produce re-ordered search results. At least a portion of the re-ordered search results (e.g., the top N ranked items, based on the re-ordering), including links to a list of documents, are sent to the requesting client 103.
In some embodiments, the analysis of an identified document at 935 includes determining a correlation between the document's content and the group profiles. Furthermore, in some embodiments, this operation includes accessing a previously computed document profile for the document and then determining a correlation between the document profile and the group profiles. In some embodiments, determining the correlation includes one or more operations that are “dot product” computations, which determine the extent of overlap, if any, between the document profile and the group profiles. In addition, instead of determining and then applying a boost factor (as in operations 935 and 940), some documents may have their group-dependent ranking score set very high or very low in accordance with information in a group profile. For instance, for a group associated with “Apple computers,” documents from websites associated with fruit and produce may be assigned a predefined very low group-dependent ranking score.
The rightmost column of each of the three tables (1010, 1030 and 1050) stores the boost factor (i.e., a computed score) of a document when the document is evaluated using one specific type of group profile. A document's boost factor can be determined by combining the weights of the items associated with the document. For instance, a category-based or term-based boost factor for users associated with an ip address range group may be computed as follows. The users may favor documents related to science with a weight of 0.6, and disfavor documents related to business with a weight of −0.2. Thus, when a science document matches a search query, it will be boosted over a business document. In general, the document topic classification may not be exclusive. A candidate document may be classified as being a science document with probability of 0.8 and a business document with probability of 0.4. A link-based boost factor may be computed based on the relative weights allocated to the preferred URLs or hosts in the link-based profile. In one embodiment, the term-based profile rank can be determined using known techniques, such as “term frequency-inverse document frequency” (TF-IDF). The “term frequency” of a term is a function of the number of times the term appears in a document. The “inverse document frequency” of a term is an inverse function of the number of documents in which the term appears within a collection of documents. For example, very common terms like “word” occur in many documents and consequently are assigned a relatively low inverse document frequency, while less common terms like “photograph” and “microprocessor” are each assigned a relatively high inverse document frequency.
In some embodiments, when a search engine generates a search result in response to a search query, a candidate document D that satisfies the search query is assigned a query score, QueryScore, in accordance with the search query. This query score is then modulated by document D's page rank, PageRank, to generate a generic ranking score, GenericScore, that is expressed as
GenericScore=QueryScore*PageRank.
This generic ranking score may not appropriately reflect document D's relevance to a particular group of users if the users' interest is measurably different from that of a random user of the search engine. The relevance of document D to the users can be characterized by a set of boost factors, based on the correlation between document D's content and the group's term-based profile, herein called the TermBoostFactor, the correlation between one or more categories associated with document D and the group's category-based profile, herein called the CategoryBoostFactor, and the correlation between the URL and/or host of document D and the group's link-based profile, herein called the LinkBoostFactor. Therefore, document D may be assigned a group-dependent ranking score that is a function of both the document's generic ranking score and the various group profile-based boost factors. In one embodiment, this group-dependent ranking score can be expressed as:
GroupScore=GenericScore*(TermBoostFactor*CategoryBoostFactor*LinkBoostFactor),
where a BoostFactor of 1.0 does not modify the generic score, and a value above or below 1.0 reduces or increases the generic score, respectively. The relative importance of each type of profile or boost factor is implemented by controlling the range of values that are allowed for a given type of boost factor. For example, a boost factor having a range of 0.1 to 10 has more importance in determining the GroupScore than a boost factor having a range of 0.5 to 2.0.
In other implementations, a linear combination of boost factors is used:
GroupScore=GenericScore*(Wterm*TermBoostFactor+Wcategory*CategoryBoostFactor+Wlink*LinkBoostFactor)
or
GroupScore=GenericScore*f(Wterm*TermBoostFactor+Wcategory*CategoryBoostFactor+Wlink*LinkBoostFactor),
where the weights (Wterm, Wcategory, Wlink) are assigned so that the value in parentheses in the above equations is equal to about 1.0 if the document is to be neither promoted nor demoted in rank for the group, above 1.0 if the document should be promoted, and below 1.0 if the document should be demoted. The greater the deviation from 1.0, the stronger the promotion or demotion of the document. The function f( ) in the last of the equations shown above can be a non-linear function (e.g., f(x)=xn where n is any suitable real value) that emphasizes or de-emphasizes the deviation from 1.0 of the value in the parentheses. In some embodiments the function f( ) is a transform function used to normalize the linear combination of boost factors to a range that is suitable for a combined boost factor. For example, the argument (input value) of the function f( ) may range from −1 to 1 (or any other suitable input range), while the value produced by the function f( ) ranges from 0.2 to 2.0 (or any other suitable output range). The f( ) portion of the above equation (i.e., the value produced by applying the function) corresponds to BoostFactor in the equation in the following paragraph.
In another embodiment, in which the group has a single profile, the group-dependent ranking score can be expressed as:
GroupScore=GenericScore*BoostFactor
where the “BoostFactor” is based on the correlation between document D's content and the group's profile.
Once search results have been identified, the relevance of each information item to the user is determined based on the groups associated with the user. For example, the search result ranker 126 computes an adjusted score or group score for each document based on the document profile and the groups' profiles (1340). The adjusted score for each document k is a function of the GenericScore (as defined above), GroupProfiles(i, j), GroupConfs(i), ProfileConfs(j), and DocumentProfile(k) for all groups i and profiles j where i, j, and k are whole numbers. For example, group profiles GroupProfiles(i, j) as depicted in
AdjustedScore(k)=DocumentProfile(k)*[GenericScore*ΣiGroupConfs(i)*Σj[ProfileConfs(j)*GroupProfiles(i, j)]]
where Σi indicates summation over all i groups and Σj indicates summation over all j profiles.
Once the adjusted score for each information item in the search result is computed, the information items are ranked accordingly (1350), and then information identifying at least a portion of the ranked search results (e.g., the top N ranked items, where N is a suitable integer) is provided to the user (1360). For example, the search engine and search ranker may produce over one thousand ranked search results, but the user may be sent a smaller number of top ranked results, where the number is between ten and 260, or between ten and 130.
Referring to
In some embodiments, the information server 106 may not have access to all the search history associated with a website. For example, there may be an agreement between a website 102 and the information server 106 with respect to the search queries submitted from the website 102. According to the agreement, when a user visiting the website 1027 submits a search query to the information server 106, the information server 106 is required to send the corresponding search result to the website 102 rather than the requesting user at a client 103. The website 102 may modify the search result, e.g., attaching advertisements or other information to the search result, and then serves the modified search result to the requesting user at the client 103.
In this scenario, the information server 106 may have no information identifying the requesting user and the client 103, and may also be unable to monitor the user's activities on the search result. For example, the information server 106 may not receive any information identifying the document links in the search result that have been clicked by the user. Similarly, the information server 106 may not receive any information identifying the document links over which the user moves his or her mouse link and the corresponding mouse hovering time. In other words, the information server 106 has very limited or no exposure to the activities of the website users on the search results. Therefore, the information server 106 has to rely on the user activities on search results from other venues to generate the group profile.
In some embodiments, by examining the search queries submitted from different websites, the information server 106 may identify another website similar to the website in question. Two websites are deemed similar if a predefined number or percentage of search queries submitted from the two websites is identical. It is also reasonable to infer that users of the two similar websites may have similar interests and therefore the user activities associated with one website are a reasonable proxy of the user activities associated with the other one. If the information server 106 can access the user activities associated with one of the two websites (e.g., there is no agreement to deliver the search results to the website), the information server 106 can use the same user activities to create the group profile for the other website.
When there is no other website similar to the website in question, the information server 106 may utilize monitored user activities associated with search queries submitted directly to the search engine (e.g., search queries submitted using a toolbar search box or a webpage associated with the information server 106) as the proxy of a particular website. For instance, the search query “golf courses in mountain view” may be submitted both to a golf-focused website, and to a general purpose search engine. Profile information developed from clicks on the search results of this search query is used to generate a group profile by combining or aggregating statistical information related to the queries received from each respective website.
Placed content may be displayed to users of search services, email services, and a variety of other services provided via the Internet or other wide area networks. For example, when search results are returned to a user in response to a search query, often times certain placed content is returned as well. Placed content is usually in the form of advertising, but could be any type of content related to the search query or to a document being sent to the user. Generally, placed is be any type of content where content providers compete or pay for placement. The techniques discussed above for selecting and ranking information items can also be used for selecting and/or ranking placed content to be presented to users. In particular, in some embodiments, group profiles are used to select advertisements or other placed content to be presented to users along with search results. For example, different advertisements may have different sets of key terms. A correlation of the key terms of each advertisement in a set of advertisements with a term-based group profile (or a category-based profile, or both) associated with a group of users produces a booster factor for the advertisement. This boost factor may be used to promote or demote the particular advertisement in response to a search query submitted by a user associated with the group. For example, when the information server 106 receives a search query “world cup 2006” from a member of a group that is positively weighted in the soccer category, it may promote those advertisements covering soccer gear, ticket sale for the 2006 FIFA World Cup Germany, and hotel reservations at the German cities hosting the soccer game, etc.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.
This application is a Continuation in Part of U.S. Patent Application Ser. No. 11/394,620, filed Mar. 30, 2006, entitled “Website Flavored Search,” which is hereby incorporated by reference. This application is related to U.S. patent application Ser. No. 10/890,854, filed Jul. 13, 2004, entitled “Personalization of Placed Content Ordering in Search Results,” which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 11394620 | Mar 2006 | US |
Child | 11675057 | Feb 2007 | US |