1. Field of the Invention
The present invention relates to data mining algorithms for detecting associations between search criteria and item categories or attributes. The results of the analysis may, for example, be used to select item categories or groupings to suggest to a user based on search criteria supplied by the user.
2. Description of the Related Art
Web sites that provide access to databases of items commonly include a hierarchical browse structure or “browse tree” in which the items are arranged within a hierarchy of item categories. The lowest level categories contain the items themselves, while categories at higher levels contain other categories. The items arranged within the browse tree may include, for example, products that are available to purchase or rent, files that are available for download, other web sites, movies, auctions, classified ads, businesses, or any combination thereof.
Some web sites direct users to specific categories of their browse trees based on search queries submitted by users. For example, if a user submits the search query “laptop computer,” the search results page may include a link to an associated browse tree category such as “portable computers” or “laptop and notebook computers.” To implement this feature, an operator of the web site typically generates a look-up table that maps specific search strings to the item categories believed to be the most closely associated with such search strings. The task of manually generating these mappings, however, tends to be very tedious and time consuming, especially if the browse tree is very large (e.g., many hundreds or thousands of categories and many thousands or millions of items). In addition, because the mappings are typically based on the web site operator's perception of which categories are the most closely related to specific search strings, the mappings tend to be inaccurate.
The present invention provides a system and associated methods for automatically detecting associations between specific sets of search criteria, such as search strings, and specific item categories or attributes. The invention may be embodied within a web site or other database access system that provides access to a database in which items are arranged or arrange-able within item categories, such as but not limited to browse categories of a hierarchical browse structure. The items may, for example, include web sites and pages, physical products, downloadable content, and other types of items that can be represented within a database and organized into categories. The detected associations are preferably used to suggest specific item categories to users on search results pages.
In a preferred embodiment, actions of users of the system are monitored over time to generate user activity data reflective of searches, item selection actions, and possibly other types of user actions. A correlation analysis component collectively analyses the user activity data to automatically identify associations between specific search criteria and specific item categories or attributes. For example, the correlation analysis component may treat a particular search string and a particular item category as related if a relatively large percentage of the users who submitted the search string also selected an item falling with the particular item category. Any one or more different types of item selection actions (item viewing events, purchases, downloads, etc.) may be taken into consideration in performing the analysis. In addition, the analysis may take into consideration whether a user's selection of an item was likely the result of a particular search performed by the user.
Neither this summary nor the following detailed description purports to define the invention. The invention is defined by the claims.
A specific embodiment of the invention will now be described with reference to the drawings. This embodiment is intended to illustrate, and not limit, the present invention. The scope of the invention is defined by the claims.
I. System Overview
The items included or represented in the database 35 may, for example, include physical products that can be purchased or rented, digital products journal articles, news articles, music files, video files, software products, etc.) that can be purchased and/or downloaded by users, web sites represented in an index or directory, subscriptions, and other types of items that can be stored or represented in a database. Many millions of different items and many hundreds or thousands of different item categories may be represented within the item database 35. Although a single item database 35 is shown, the database 35 may be implemented as a collection of distinct databases, each of which may store information about different types or categories of items.
The item categories preferably include or consist of browse categories used to facilitate navigation of an electronic catalog of items. For example, as depicted in
As depicted by the query server 38 in
When a user submits a search query, the web server 32 passes the search query to the query server 38, which generates and returns a list of the items that are responsive to the search query. As is conventional, the query server 38 may use a keyword index (not shown) to search the item database 35 for responsive items. In addition to obtaining the list of responsive items, the web server 32 accesses a mapping table 40 that maps specific sets of search criteria, such as specific search terms and/or search phrases, to the item categories most closely related to such search criteria. If a matching table entry is found, the web server 32 displays some or all of the related item categories on the search results page together with the responsive items (see
In the preferred embodiment, when a user selects an item on a search results page or a browse node page (i.e., a category page of the browse tree 36), the web server 32 returns an item detail page (not shown) for the selected item. The item detail page includes detailed information about the item, such as a picture and description of the item, a price, and/or user reviews of the item. The item detail page may also include links for performing such selection actions as adding the item to a personal shopping cart or wish list, purchasing the item, downloading the items, and/or submitting a rating or review of the item. The web server 32 preferably generates the various pages of the web site, including the item detail pages, search results pages, and browse node pages, using templates stored in a database of web page templates 39.
II. Automated Detection of Associations between Search Criteria and Item Categories
An important aspect of the system 30 is that the search criteria/item category associations reflected in the mapping table 40 are detected automatically by collectively analyzing user activity data reflective of search query submissions and item selection actions performed by a population of users, which may include many thousands or millions of users. This is accomplished in part by maintaining a database 42 or other repository of user activity data reflective of search query submissions and item selection actions performed by users of the system.
To detect correlations between specific search criteria and item categories, a correlation analysis component 44 periodically analyzes sets or segments of this user activity data to search for correlations. For example, the correlation component 44 may treat the search string “Java” and the item category “books>computer languages” as being related if a large percentage of the users who searched for “Java” within a given time period also selected an item falling with the books>computer languages category within this same time period. The analysis may also take into consideration the categories explicitly selected by users during navigation of the browse tree. For example, the correlation analysis may detect that a large percentage of the users who searched for “socks” also selected the brand-based category “apparel>Foot Locker,” and treat the two as related as a result. The correlation analysis component 44 may be implemented as a program that is executed periodically by an off-line computer system.
The use of an automated computer process to detect the search criteria/item category associations provides a number of important benefits. One such benefit is that mappings for many thousands of different sets of search criteria can be generated with very little or no human intervention. For example, mappings may be generated for each of the 5K (5×1024) or 10K most commonly entered search strings. Another benefit is that the mappings tend to be very accurate, as they reflect the actual browsing patterns of a large number of users. An additional benefit is that the mappings can evolve automatically over time as new items and item categories are added to the database 35, and as search and browsing patterns of users change.
As depicted in
The event data recorded for an item selection action may, for example, include the ID of the selected item, an ID of the user or user session, and an event time stamp. Other types of item-selection event data that may be recorded, and used to detect the associations, may include the following: the type of selection action performed (e.g., selection of item for viewing, selection of item to download, shopping cart add, purchase, submission of review or rating, etc.), and the type of page from which the item selection was made (e.g., search results page, browse node page, etc.). The type or types of item selection actions that are recorded within the user activity database 42 and used to detect the associations may vary depending upon the nature of the web site (e.g., web search engine site, retail sales site, digital library, music download site, product reviews site, etc.). If multiple different types of item selection actions are recorded, the correlation analysis component 44 may optionally accord different weights to different types of selection actions. In addition to item selection events, other types of events, such as category selection events, may be recorded within the user activity database 42 and used to detect the associations.
The event histories may be stored within the user activity database 42 in any of a variety of possible formats. For example, the web server 32 may simply maintain a chronological access log that describes some or all of the client requests it receives. A most recent set of entries in this access log may periodically be retrieved by the correlation analysis component 44 and parsed for analysis. Alternatively, the event data may be written to a database system that supports the ability to retrieve event data by user, event type, event date and time, and/or other criteria; one example of such a system is described in U.S. patent application Ser. No. 10/612,395, filed Jul. 2, 2003, the disclosure of which is hereby incorporated by reference. Further, different databases and data formats may be used to store information about different types of events (e.g., search query submissions versus item selection actions).
For purposes of analysis, the user activity data (event histories) stored in the database 42 may be divided into segments, each of which corresponds to a particular interval of time such as one day or one hour. The correlation analysis component 44 may analyze each such segment of activity data separately from the others. The results of these separate analyses may be combined to generate the mappings reflected in the mapping table 40, optionally discounting or disregarding the results of less recent segments of activity data. For example, correlation results files for the last X days (e.g., two weeks) of user activity data may be combined to generate a current set of mappings, and this set of mappings may be used until the next segment of user activity data is processed to generate new mappings. An example of an algorithm that may be used to analyze the user activity data is depicted in
Each entry in the mapping table 40 maps a specific set of search criteria, such as a specific search term or search phrase, to a list of the N item categories that are the most closely related to that set of search criteria, where N is a selected number such as ten, twenty or fifty. (A “set” of search criteria, as used herein, can consist of a single element of search criteria, such as a single search term.) For each category in this list, the table may also include a “correlation score” that indicates a degree to which the category is associated with the corresponding set of search criteria. In the illustrated example, the scores can range from 0 to 1, with a score of “0” indicating a minimal degree of correlation and a score of “1” indicating a maximum degree of correlation. The first sample table entry shown in
The mapping table 40 may, for example, include a separate entry for each of the M (e.g., 5K or 10K) search strings that were used the most frequently over a selected period of time. Search strings that are highly similar, such as those that are identical when capitalization, noise words (“a,” “the,” “an,” etc.), and punctuation variations are ignored, may be treated as the same search string for purposes of generating the table 40. The mapping table 40 may be implemented using any type of data structure, or combination of data structures, that permits efficient look-up of categories. One example of a type of data structure that may be used is a hash table
Although the mapping table 40 depicted in
It should be noted that the item categories included in the mappings need not consist of browse categories that are ordinarily used to browse the catalog of items, but rather may include specific item attributes that may be used to form a grouping of items. For instance, a particular search string may be mapped to a particular product brand (one example of a product attribute), even though the web site's browse interface does not support browsing of the catalog by brand. Thus, for example, when a user searches for “PDA,” the user may be given an option to view all products from “Palm” and “Mindspring,” even if the system's browse tree does not include links for either of these brands. Accordingly, any group of items that share a common attribute (e.g., author=Clark) may be treated as an item category for purposes of implementing the invention. In this regard, a category may be represented within the mapping table 40 as a particular attribute (e.g., brand=Sony) or attribute set (e.g., type=video and rating=G), rather than by a category name or ID.
In block 64, the retrieved selection event data is used to generate a temporary table 64A that maps users to the item categories “accessed” by such users. For purposes of generating this table, a selection of an item that falls within a given category may be treated as an access to that category. The type or types of item selection actions taken into consideration in determining whether a user “accessed” a given category is a matter of design choice, and may vary depending on the type of items involved. For instance, for a category of merchandise items, the category may be treated as accessed if the user purchased, added to a shopping cart, added to a wish list, or even viewed an item falling within that category. For a category of web sites listed in a web site directory, the category may be treated as accessed if, for example, the user selected a link within the directory to access a web site within that category. For a category of news or journal articles, the category may be treated as accessed if, for example, the user viewed or downloaded the full text of an article within that category. For browse categories, a category may also optionally be treated as accessed if the user selected the category itself during navigation of a browse tree to view a corresponding category page; in this regard, a browse category may, in some embodiments, be treated as accessed only if the user actually selected the browse category itself.
In block 66, the temporary search string table 62A is used to identify search strings that are “popular.” A given search string may be treated as popular if, for example, it was submitted by more than a selected threshold of users (e.g., ten) over the relevant time interval. In block 68, the temporary tables 62A, 64A are used to count, for each (popular search string, item category) pair, the number of users in common (i.e., the number that both submitted the string and accessed the category during the relevant time period). The results of this task are depicted by the preliminary mapping table 68A in
In block 70, a correlation score is calculated for each (popular string, item category) pair. The equation shown below may be used for this purpose, in which “CS” stands for “correlation score:”
CS(string, category)=C/SQRT(A·B)
where:
The correlation score is a measure of the degree to which the particular search string and item category are related. Any of a variety of other equations or algorithms may be used to calculate the correlation scores. The following are examples:
Cosine Method:
CS(string, category)=C/SQRT(A·B)
where:
Relative Risk Method:
CS=(A/B)/(C/D)
where:
Odds Ratio Method:
CS=(A/C)/(E/F)
where:
Probability Lift Method:
alpha=32*log(frequency-of-use rank of B)−84
CS=C/B−(alpha)*A/D
where:
Weighted method: The above mentioned scores can be combined in a variety of ways to produce a weighted average of multiple scores. For example:
ΣWiCSi
where W is a weighting function for each correlation score, CS is the correlation score itself, and ΣWi=1. For example, we could combine the Cosine and Probability List methods as follows:
CS=w(Cosine Method)+(1−w)*(Probability Lift Method)
where w is a weighting factor such as 0.20.
In block 72, for each popular string, the list of categories (CAT_A, CAT_B, CAT_C . . . ) is sorted from highest to correlation score, or equivalently, for highest to lowest degree of association with the particular search string. In addition, each such list of categories is truncated to a fixed maximum length (e.g. ten categories), so that only those categories most closely related to the particular search string are retained in each list. The result of block 72 is a set of string-to-category mappings of the form shown in
As will be apparent from the foregoing description of
Another variation is to limit the analysis to the detection of associations between specific search terms (keywords) and item categories. With this approach, each entry in the mapping table 40 corresponds uniquely to a specific search term. If a user submits a search query containing two or more search terms, the mapping table entries (category sets) for each of these search terms may be used in combination to identify item categories to suggest to the user, such as by taking the intersection of these category sets.
Other types of relatedness metrics may also be taken into consideration when generating the mapping table 40. For instance, the correlation data generated by analyzing the user activity data may be combined with the results of an automated content-based analysis in which the search strings are compared to item records or descriptions in the database 35. Thus, the mappings reflected in the mapping table 40 need not be based exclusively on an analysis of user activity data.
III. Use of Mapping Table to Supplement Search Results Pages
If a match is found in block 84, the associated list of item categories is retrieved from the mapping table 40. As depicted in block 90, this list may optionally be filtered to remove certain types of categories (e.g., all but top-level categories), and/or to filter out those categories having a correlation score that falls below a desired threshold. Some or all of the categories in this list are then incorporated into the search results page (block 94), together with a list of any responsive items.
The second section 102 in
Yet another approach, which is not illustrated in the drawings, is to arrange the search results (matching items) by item category on the search results page, with the item categories being ordered from highest to lowest degree of association with the search string. To facilitate viewing of results from multiple categories, a limited number of matching items (e.g. 3, 4 or 5) may be displayed on the search results page within each such item category.
IV. Tracking of Category Selection Actions on Search Results Pages
One optional feature of the invention is to track the frequency with which users select specific categories displayed on the search results pages. This data may be used as an additional or alternative metric to select the related categories to display on a given search results page, and/or to select the order in which these related categories are displayed. For instance, referring to
To implement this feature, the web server 32, or a component that runs on or in conjunction with the web server 32, may store within the mapping table 40 the following information for each search string/related category pair: (a) the number of times this pair was displayed on a search result page (i.e., the number of impressions), and (b) the number of times the display of this pair resulted in user selection of the particular category (i.e., the number of clicks). The impressions and clicks values may be updated in real time as pages are served, or may be derived from an off-line analysis user activity data. Rather than storing the actual impressions and clicks counts for each search string/related category pair, the ratio of these two values may be stored, particularly if some threshold number of impressions has been reached.
When a user conducts a search, the related categories stored in the mapping table 40 for the submitted search string may be ordered/ranked for display from highest to lowest clicks-to-impressions ratio. For example, for the search string “California Hiking Trails” shown in
This feature of the invention may also be used in embodiments in which the mapping table 40 maps more generalized sets of search criteria to related categories.
Although this invention has been described in terms of certain preferred embodiments and applications, other embodiments and applications that are apparent to those of ordinary skill in the art, including embodiments which do not provide all of the features and advantages set forth herein, are also within the scope of this invention. Accordingly, the scope of the present invention is defined only by the appended claims, which are intended to be interpreted without reference to any explicit or implicit definitions that may be set forth in the incorporated-by-reference materials.