Marketing on the World Wide Web (the Web) is a significant business. Users often purchase products through a company's Website. Further, advertising revenue can be generated in the form of payments to the host or owner of a Website when users click on advertisements that appear on the Website. The amount of revenue earned through Website advertising and product sales may depend on a Website's ability to attract visitors and develop a loyal base of returning visitors. Often, the ability to attract a visitor to a particular Website depends on the organization of the Website and whether the user is able to effectively navigate the Website to locate relevant information or products.
Certain exemplary embodiments are described in the following detailed description and in reference to the drawings, in which:
Exemplary embodiments of the present invention provide techniques for delivering personalized Web page content that more closely represents the interests of a visitor to a Web page. As used herein, the term “exemplary” merely denotes an example that may be useful for clarification of the present invention. The examples are not intended to limit the scope, as other techniques may be used while remaining within the scope of the present claims. The techniques disclose herein can improve a Website experience by personalizing the appearance and content of the Website, which may lead to increased traffic and, thus, revenue for the Website.
In exemplary embodiments of the present invention, cluster information is generated and used to provide a cluster type or a vocabulary of possible user interests for a user identifier (user ID) that is used to access one or more Websites. A user ID is a unique identifier used to identify a particular system used to access a Website, for example, an IP address, a user name, and the like. The cluster information may be generated by statistically processing a database of Web activity, for example, a list of search queries performed on one or more search engines from one or more different user IDs. The resulting cluster information provides groupings of Websites and groupings of words that pertain to the Websites. The groupings, referred to herein as “clusters,” may be used to characterize the content of individual Websites in terms of the interests of users that visit those Websites. Each cluster represents a unique cluster type and may be assigned a unique cluster-type descriptor.
Cluster types corresponding to the interests of a particular user are determined by accesses of a particular Website by that user's user ID. These accesses are stored in a user profile based on the prior Web activity from the user ID, such as prior search queries performed from the user ID. Upon accessing a selected Website, a determination may be made regarding which cluster types in the user profile relate to content available from the selected Website. If matching cluster types are detected, one or more of cluster types may be sent to the Website. The Website may use the cluster types to customize the Website according to the interests indicated by accesses from the user ID.
Exemplary embodiments of the present invention enable a Website to receive relevant user interest information from a visitor while reducing the likelihood that extraneous or irrelevant user interest information of the visitor will also be received by the Website. Additionally, sending a cluster type to the Website rather than more detailed search query information may help to protect the privacy of Website visitors while still enabling the delivery of personalized Website content.
The client system 102 can have other units operatively coupled to the processor 112 through the bus 113. These units can include tangible, machine-readable storage media, such as a storage system 122 for the long term storage of operating programs and data, including the programs and data used in exemplary embodiments of the present techniques. The storage system 122 may also store a database of cluster information and a user profile generated in accordance with exemplary embodiments of the present techniques. Further, the client system 102 can have one or more other types of tangible, machine-readable storage media, such as a memory 124, for example, which may comprise read-only memory (ROM) and/or random access memory (RAM). In exemplary embodiments, the client system 102 will generally include a network interface adapter 126, for connecting the client system 102 to a network, such as a local area network (LAN 128), a wide-area network (WAN), or another network configuration. The LAN 128 can include routers, switches, modems, or any other kind of interface device used for interconnection.
Through the LAN 128, the client system 102 can connect to a business server 130. The business server 130 can have a storage array 132 for storing enterprise data, buffering communications, and storing operating programs for the business server 130. The business server 130 can have associated printers 134, scanners, copiers and the like. The business server 130 can access the Internet 110 through a connected router/firewall 136, providing the client system 102 with Internet access. Those of ordinary skill in the art will appreciate that business networks can be far more complex and can include numerous business servers 130, printers 134, routers 136, and client systems 102, among other units. Moreover, the business network discussed above should not be considered limiting as any number of other configurations may be used. For example, in other exemplary embodiments, the client system 102 can be directly connected to the Internet 110 through the network interface adapter 126, or can be connected through a router or firewall 136. Any system that allows the client system 102 to access the Internet 110 should be considered to be within the scope of the present techniques.
Through the router/firewall 136, the client system 102 can access a search engine 104 connected to the Internet 110. In exemplary embodiments of the present invention, the search engine 104 can include generic search engines, such as GOOGLE™, YAHOO®, BING™, and the like. The client system 102 can also access the Websites 106 through the Internet 110. The Websites 106 can have single Web pages, or can have multiple subpages 138. The Websites 106 can also provide search functions, for example, searching subpages 138 to locate products or publications provided by the Website 106. For example, the Websites 106 may include sites such as EBAY®, AMAZON.COM™, WIKIPEDIA™, CRAIGSLIST™, FOXNEWS.COM™, and the like. Further, one or more of the Websites 106 may be configured to receive information from a visitor to the Website, for example, from a unit located at a particular user ID, regarding interests of the user, and the Website may use the information to determine the content to deliver to the user ID.
The client system 102 may also access a database 144, which is connected to the Internet 110 and includes details of searches performed from a plurality of user IDs across a plurality of Websites. The search query data may be collected by an Internet service provider (ISP) or by the Website 106. Each search query record in the database 144 may include one or more search terms and an associated Website. The associated Website may be the Website that the user ID was accessing when the search was performed, or the associated Website may be the Website that the user ID accessed after performing the search. The database 144 may also include cluster information, which may be generated, at least in part, by an automated analysis of the search query data, as described below in reference to
The bag of words may be generated by any suitable technique. In one exemplary embodiment, a bag of words may be generated for each search term by using the original search term to perform a new search on a canonical search engine, such as YAHOO® or GOOGLE™. A specified number of the top ranked Web pages returned by the search may be accessed, and each word from each Web page may be added to the bag of words applicable for that search term. In exemplary embodiments of the present invention, the list of words from each Web page may be processed to eliminate common or unimportant words, such as “a”, “the,” “HTTP,” Tag,” and the like. Further, frequency algorithms may be applied to select only a subset of the words if desired. Such algorithms may eliminate words that are used too few times in a site to be significant, for example, words that appear only once, twice, or a few times. In addition, techniques such as Porter stemming algorithms may be applied to eliminate common suffixes and further narrow the list.
Prior to performing the new search, the original search term may be expanded based on the Website associated with it. For example, if the original search query was performed at a Website of a book vendor, the search term used in the new search may be expanded by adding the word “book.” Similar rules can be constructed for domain specific-Websites. For example, highly targeted websites may sell a particular category of products such as garden supplies, in which case the expansion is straightforward due to the limited number of possible terms. In other cases, a search at a website that sells a wide array of products (for example, AMAZON.COM™) can be expanded based on the subsequent link that was clicked on from the search results page. Further, some websites allow categorical searches and the knowledge of the category information leads to a natural way of expanding the search. Additionally, if the search query data includes the Website that was clicked on at the time of the original search, each word from that Web page may also be added to the bag of words.
At block 204, cluster information is generated from the augmented search query data. The cluster information may be generated by automated analysis of the augmented search query data, for example, a statistical analysis such as clustering, co-clustering, information-theoretic co-clustering, and the like. In one exemplary embodiment of the present invention, the automated analysis includes loading the augmented search query data into a word/Website matrix and segmenting the words and Websites into clusters. The resulting cluster information may include groupings of words and Websites, referred herein as “clusters,” that may be used to classify subject matter available on the Internet. As used herein, the term “cluster type” refers to a unique cluster that represents a particular user interest or type of Web content. Each cluster type may be associated with a group of words that characterize the cluster type as well as one or more Websites that contain subject matter relevant to the cluster type. Each cluster may also be assigned a unique cluster-type descriptor, as will be explained further below. An exemplary clustering technique may be better understood with reference to Table 1.
Table 1 is a graphical representation of an exemplary word/Website matrix that may be used to generate the clustering information. It should be recognized that this is a simplification as many applications will generally be more complex, as discussed below. As shown in Table 1, words from the search query data may be distributed along rows and Website addresses from the search query data may be distributed along columns. For each word-Website pair in the search query data, the matrix entry at the intersection of the word and Website may be set to 1. All other matrix entries may be empty or set to zero.
After filling the matrix, the words and Websites may be grouped according to the distribution of matrix entries. The words may be grouped together based on the similarity of each word's distribution of column entries. The Websites may be grouped together based on the similarity of each Website's distribution of row entries. For example, referring to Table 1, it can be seen that the rows corresponding to the words “car,” “auto,” and “automobile” have identical distributions of column entries. Thus, the words “car,” “auto,” and “automobile” may be grouped into the same cluster. Additionally, the columns corresponding to the Websites “CARS.COM™,” “AUTOS.COM™” and “EDMONDS.COM™” have very similar distributions of row entries. Thus, the Websites “CARS.COM™,” “AUTOS.COM™” and “EDMONDS.COM™” may also be grouped into the same cluster.
Table 2 represents an example of cluster information that may be obtained after the automated analysis of the exemplary word-Website matrix of Table 1. Each cluster may be assigned a unique cluster-type descriptor, for example, a cluster number. Furthermore, after the clusters have been generated via the automated analysis, the cluster data may be viewed and a textual cluster-type descriptor may be assigned to each cluster based on the apparent subject matter encompassed by each cluster. For example, the third and fourth columns of Table 2 relate to cluster 2, which has been assigned the textual cluster-type descriptor “automobiles.” The exemplary cluster includes the Websites “CARS.COM™,” “AUTOS.COM™” and “EDMONDS.COM™” and the words “car,” “auto,” and “automobile,” among others.
It can be appreciated from the foregoing example, that the similarity between the words and the Websites can be ascertained without knowing the meanings of the words or the content of the Websites. In other words, the process of generating the clusters does not involve human lexical interpretation.
As previously noted, the graphical representation of the word/Website matrix of Table 1 is provided merely as an aid to explaining the invention. In actual practice, the word/Website matrix will generally be more complex, for example, including several thousands of words and Website addresses stored in a machine-readable medium for electronic processing.
Furthermore, while clusters for words and websites are aligned in the present example, this is unlikely to be the case in many situations. For example, if there are 100 word clusters and just 20 website clusters, each website (or website cluster) could then be represented in terms of the 100 word clusters. This may be performed by determining the counts of how many words from each of these clusters belong to that website. Further, some websites (like AMAZON™) might cover books, appliances, music, etc., while others (APPLIANCE.COM) might just cover appliances. The clustering algorithm would segment searches into clusters like “books”, “appliances”, “music”, “cars”, and the like. AMAZON™ would be connected to the first 3 clusters (but not to “cars”), but APPLIANCES.COM™ would just be connected to the appliances cluster. Accordingly, in exemplary embodiments, searches done on APPLIANCES.COM™ could be transferred to AMAZON.COM™, but only a subset of AMAZON.COM™ searches would be transferred to APPLIANCES.COM™.
The cluster information may provide a vocabulary that may be used to characterize the interests of various users and the subject matter offered by various Websites. Thus, the clustering information may be used to match user interests with relevant Website content. Accordingly, referring also to
At block 206, cluster types may be stored in a user profile based on the prior Web activity from the user ID, for example, based on prior search queries from the user ID. In exemplary embodiments, search terms entered by the user in prior searches may be compared with the clustering information to determine which cluster types correspond with the search terms. Descriptors for these cluster types may be stored to the user profile. An exemplary method of generating a user profile is described further in relation to
At block 208, a user ID is used to access a selected Website and the client system 102 associated with the user ID provides one or more cluster types to the Website 106. Upon accessing the Website, the client system 102 may search for matches between Website content and the user's interests as indicated by the user profile. Both the Website content and the user profile may be described in terms of cluster types. The client system 102 may search the user profile for matching cluster types that are common to both the user profile and the selected Website. One or more of the matching cluster types may then be sent to the Website server 106, enabling the Website server to personalize the Website according to a user's interests. An exemplary method of locating a cluster type in the user profile and sending the cluster type to a Website is described further in relation to
At block 210, the content provided by the selected Website to the user ID of the client system 102 may be determined based on the cluster types received by the Website from the client system 102. In this way, the selected Website, including the initial Web page and subsequent subpages, may be personalized according to interests indicated by a particular user ID.
At block 304, the search terms used in the search query may be used to generate a bag of words. The bag of words may be generated according to the method described in reference to block 202 of
At block 306, the bag of words may be compared with the clustering information to determine one or more cluster types that correspond with the search performed by from the user ID at block 302. The cluster types applicable to the search may be determined by correlating the words in the bag of words with the words included in the cluster information. The cluster types that have the most words in common with the bag of words may be added to the user profile. For example, each word in the bag of words may be looked for in the clustering information and a match between a word in the bag of words and a word in a specific cluster type may result in a “hit” for that cluster type. The total number of hits for each cluster type may be tallied to determine the one or more cluster types that correspond more closely with the words in the bag of words.
At block 308, cluster types may be saved to the user profile. Saving a cluster type to the user profile may include saving the cluster-type descriptor corresponding with the cluster type to the user profile. In exemplary embodiments of the present invention, the cluster type with the highest number of hits may be saved to the user profile. In other exemplary embodiments, two or more cluster types may be added to the user profile depending on the distribution of hits between the cluster types. For example, the cluster types may be ranked according to the total number of hits for each cluster type, and two or more of the top ranked cluster types may be entered into the user profile. In exemplary embodiments of the present invention, the method 300 is performed by the user's computer, for example the client system 102. In other exemplary embodiments, the method 300 may be performed by the Website at which the user performed the search query referenced in block 302. Accordingly, the Website may save the cluster type to the user profile by storing the cluster type in a cookie on the user's computer. In other exemplary embodiments, the method 300 may be performed at a server hosted by the ISP or a third party based on the search query referenced in block 302.
In an exemplary embodiment of the present invention, each cluster type entered into the user profile may be associated with a time factor that may be used to determine the age of each cluster type entry in the user profile. The time factor may include a time stamp indicating the date and/or time that the cluster type was added to the user profile. Alternatively, the time factor may include a time-decaying weighted vector that may be periodically adjusted to indicate an age of the cluster type entry. In some exemplary embodiments, the time-decaying weighted vector may be periodically adjusted to decay exponentially over time. The time factor may be used to attach greater relative importance to more recent searches. In this way, more user interests indicated by more recent Website accesses may take priority over user interests indicated by older Website accesses in personalizing a Website for a particular user ID.
Additionally, each cluster type entered into the user profile may be ranked to indicate a magnitude of the user's interest in the content related to the cluster type. In one exemplary embodiment, each cluster type entry may be associated with a frequency indicator that indicates a number of times that the user ID was used to perform a search corresponding with the cluster type. Accordingly, if a user ID is used to perform a search corresponding with a cluster type that has been previously added to the user profile, the frequency indicator for that cluster type entry may be incremented. Methods of personalizing the content of a Webpage are further described in relation to
At block 404, the cluster information may be analyzed to identify cluster types corresponding with the selected Website. For example, the list of clusters in the cluster information may be searched to identify the one or more clusters that include the address of the selected Website. As a further illustration, if the user ID accesses AMAZON.COM™, analysis of the cluster information may identify cluster types pertaining to books, movies, video games, electronics, and any other product available on the AMAZON.COM™ Website.
At block 406, the user profile may be analyzed to identify matching cluster types that are common to both the selected Webpage and the user profile. The matching cluster types may indicate a match between the user interests and the available content that may be provided by the selected Website.
At block 408, the one or more matching cluster types may then be sent from the client system 102 to the Website 106. In some embodiments, sending a cluster type to a Website 106 may include sending the cluster-type descriptor corresponding with the cluster type to the Website 106. As discussed above in relation to
In some instances, several matching cluster types may be identified for a particular Website and user profile. Therefore, the client system 102 may send a subset of the matching cluster types to the Website server. Accordingly, the matching cluster types may be ranked and the subset of matching cluster types may include one or more of the top ranked matching cluster types. In some exemplary embodiments, the ranking of the matching cluster types may be based, in part, on the magnitude of the user interest as indicated, for example, by the frequency indicator. In other exemplary embodiments, ranking of the matching cluster types may be based, in part, on the age of the user interest as indicated, for example, by the time stamp or the time-decaying weighted vector associated with the cluster type in the user profile. In this way, more relevant matching cluster types may be sent to the Website server.
For example, if a user ID was used to perform a large number of searches related to fly fishing shortly in time (for example, within a day, a week, or a month) before accessing AMAZON.COM™, a matching cluster type related to fly-fishing may be given a high rank compared to other matching cluster types. Thus, the AMAZON.COM™ Website may be more likely to display books related to fly fishing. Conversely, if a user ID was used to perform a small number of searches related to astronomy several months prior to accessing AMAZON.COM™, a matching cluster type related to astronomy may be given a low rank compared to other matching cluster types. Thus, the AMAZON.COM™ Website may be less likely to display books related to astronomy. In some exemplary embodiments of the present invention, the rank associated with each cluster type may also be sent to the selected Website.
At block 410, the selected Website may determine the content of the initial Web page based on the one or more matching cluster types received from the client system 102. For example, if the selected Website is AMAZON.COM™ and the Website receives a cluster type related to an interest in astronomy, the AMAZON.COM™ initial Web page may be personalized to display books related to astronomy. Furthermore, referring to
The process used by the Website to determine subject matter related to the cluster type may depend on the way in which the cluster type was sent to the Website. For example, if a textual cluster-type descriptor is sent to the Website, the Website may perform a keyword search using the textual descriptor. Similarly, if one or more words from the cluster are sent to the Website, the Website may perform a keyword search using the one or more words from the cluster. Subject matter located via the keyword search may then be incorporated into the initial Web page and subsequent subpages to which the user ID may access. In this example, the Website may or may not have access to the cluster information. However, if a cluster ID number is sent to the Website, the Website may correlate the cluster ID number with relevant subject matter known to correspond with the cluster ID number. In this example, the Website may have access to a list of subjects that correlate with each cluster ID number. Additionally, in this example, the Website may have access to the cluster information. Thus, the Website may use the cluster ID number to search the cluster information for the actual cluster that corresponds with the cluster ID number. The Website may then obtain the words that are included in the cluster and use those words to perform a keyword search for relevant subject matter.
The various software components discussed herein can be stored on the tangible, machine-readable medium 500 as indicated in
Although shown as contiguous blocks, the software components can be stored in any order or configuration. For example, if the tangible, machine-readable medium 500 is a hard drive, the software components can be stored in non-contiguous, or even overlapping, sectors.