Marketing on the World Wide Web (the Web) is a significant business. Users often purchase products through a company's Website. Further, advertising revenue can be generated in the form of payments to the host or owner of a Website when a user selects an advertisement that appears on the Website. The amount of revenue earned through Website advertising and product sales may depend on the Website's ability to provide marketing material or other Web content that is targeted to specific users, based on the user's interests.
Certain exemplary embodiments are described in the following detailed description and in reference to the drawings, in which:
Exemplary embodiments of the present invention provide techniques for generating a segmentation of Web content. As used herein, the term “exemplary” merely denotes an example that may be useful for clarification of the present invention. The examples are not intended to limit the scope, as other techniques may be used while remaining within the scope of the present claims. These techniques can provide methods for characterizing a particular user identification (user ID) in terms of the Web content accessed from that user ID and characterizing a particular Website in terms of the Web content provided. The segmentation results may be used to target Web content to specific user IDs.
In exemplary embodiments of the present invention, a segmentation of user IDs and Web content is generated and used to identify user IDs that have similar interests. The segmentation information may be useful for providing targeted Web content to a user ID. For example, a user of a user ID that regularly accesses a business page on a first Website may be interested in a similar business page on a second Website, even though the user may never have accessed the page on the second Website. If numerous other user IDs that have been used to access both Websites, the user IDs may placed in a segment with the similar business pages on both the first and the second Websites. The segment information may then be used to provide a suggestion to the user to access the business page on the second Website. In other exemplary embodiments, the segment information may be used to provide specific advertising to a certain user ID.
The segments may be generated by statistically processing a database of Web activity (such as clickstream data), for example, by information-theoretic co-clustering or other machine learning techniques based on statistical or stochastic processes. As used herein, a “database” is an integrated collection of logically related data that consolidates information previously stored in separate locations into a common pool of records that provide data for an application.
In an exemplary embodiment, the clickstream data for a plurality of user IDs may be processed to generate segments that correlate user IDs with Website accesses. Furthermore, prior to segmenting the clickstream data, the clickstream data may be processed to automatically determine a level of abstraction for uniform resource locators (URLs) that provides a more useful grouping of user IDs and Web pages. It should be clear that the present invention is not limited to the analysis of URLs (i.e., hyper-text transfer protocol sites). In other embodiments, information accessed under any number of other protocols (such as file transfer protocol (FTP), user datagram protocol (UDP), and the like) may be analyzed and used to provide targeted web content. These protocols may be formatted using a uniform resource identifier (URI) such as a URL.
The pre-segmentation processing of the clickstream data may include generating a plurality of features corresponding to each uniform resource locator (URL) in the clickstream data and filtering out the features that are not sufficiently supported. The resulting segment information provides groupings of Web pages and groupings of user IDs that have tended to visit those Web pages. The groupings, referred to herein as “segments,” may be used to provide users with Web content that is targeted to a particular user's interests.
The client system 102 can have other units operatively coupled to the processor 112 through the bus 113. These units can include tangible, machine-readable storage media, such as a storage system 122 for the long term storage of operating programs and data, including the programs and data used in exemplary embodiments of the present techniques. The storage system 122 may also store a user profile generated in accordance with exemplary embodiments of the present techniques. Further, the client system 102 can have one or more other types of tangible, machine-readable media, such as a memory 124, for example, which may comprise read-only memory (ROM), random access memory (RAM), or hard drives in a storage system 122. In exemplary embodiments, the client system 102 will generally include a network interface adapter 126, for connecting the client system 102 to a network, such as a local area network (LAN 128), a wide-area network (WAN), or another network configuration. The LAN 128 can include routers, switches, modems, or any other kind of interface device used for interconnection.
Through the LAN 128, the client system 102 can connect to a business server 130. The business server 130 can also have machine-readable media, such as storage array 132, for storing enterprise data, buffering communications, and storing operating programs for the business server 130. The business server 130 can have associated printers 134, scanners, copiers and the like. The business server 130 can access the Internet 110 through a connected router/firewall 136, providing the client system 102 with Internet access. The business network discussed above should not be considered limiting, as any number of other configurations may be used. Any system that allows a client system 102 to access the Internet 110 should be considered to be within the scope of the present techniques.
Through the router/firewall 136, the client system 102 can access a search engine 104 connected to the Internet 110. In exemplary embodiments of the present invention, the search engine 104 can include generic search engines, such as GOOGLE™, YAHOO®, BING™, and the like. The client system 102 can also access the Websites 106 through the Internet 110. The Websites 106 can have single Web pages, or can have multiple subpages 138. Although the Websites 106 are actually virtual constructs that are hosted by Web servers, they are described herein as individual (physical) entities, as multiple Websites 106 may be hosted by a single Web server and each Website 106 may collect or provide information about particular user IDs. Further, each Website 106 will generally have a separate identification, such as a URL, and function as an individual entity.
The Websites 106 can also provide search functions, for example, searching subpages 138 to locate products or publications provided by the Website 106. For example, the Websites 106 may include sites such as EBAY®, AMAZON.COM®, WIKIPEDIA™, CRAIGSLIST™, FOXNEWS.COM™, and the like. In exemplary embodiments of the present invention, one or more of the Websites 106 may be configured to collect information about a visitor, such as using the visitor's user ID to access segment information. The Website 106 may use the segment information to determine targeted content to deliver to the user ID.
The client system 102 and Websites 106 may also access a database 144, which may be connected to an Internet service provider (ISP) 146 on the Internet 110. The database 144 may be accessible to the client system 102 and one or more of the Websites 106 and may store clickstream data, as described below in reference to
The segment information may determine groups of users that tend to visit the same Web pages and groups of Web pages that tend to be visited by the same users. The segment information, therefore, enables users and Web pages to be grouped according to similar visitation patterns. The segmentation of Web content may then be used by the Websites 106 to determine the content of a Web page based on the visitation patterns of the user. For example, the segment information may be used to deliver targeted Web page advertising.
The method is generally referred to by the reference number 200 and may begin at block 202, wherein a database of clickstream for a plurality of user IDs is obtained. The clickstream data may include a recording of the Web browsing activity from a large number of user IDs. For example, the clickstream data may include user IDs in the form of encoded IP addresses that correspond to individual client systems 102 (
The URLs contained in the clickstream data may include various levels of abstraction. A URL with a high level of abstraction is one that may represent a broad range of subject matter, for example, a domain name of a Website such as “http:/www.google.com.” A URL with a low level of abstraction is one that may represent very specific subject matter, for example, a specific article or publication such as “http://www.google.com/support/websearch/bin/answer=136861.” It will be appreciated that URLs with a low level of abstraction may represent specific Web content that may not be accessed from a large number of user IDs. Therefore, URLs that are too abstract may not be visited from enough user IDs to provide data for a meaningful statistical analysis. For example, if a Website 106 is visited from less than about 20 user IDs, the sample set may not be large enough to be statistically significant.
On the other hand, a URL that is very general may be visited from large numbers of user IDs representing users with very divergent sets of interests. For example, AMAZON.COM™ and CNN.COM™ are likely to both have been accessed from any one user ID. Thus, URLs at the highest level of abstraction, which may have been accessed from most (for example, greater than about 50%) user IDs, may not provide useful information regarding specific interests of groups of individuals. Therefore, URLs that are too abstract or too specific may not yield useful results during the segmentation of Web content, as described below. To avoid this problem, the highly abstract URLs may be reduced to a lower level of abstraction. Exemplary embodiments of the present invention provide techniques for automatically determining the level of URL abstraction that provides a useful and accurate segmentation of Web content, as described below.
At block 204, the clickstream data may be augmented by generating a plurality of features from the URLs contained in the clickstream data. In some exemplary embodiments, the features may be generated by truncating the URL. For example, the URL may be successively truncated at each forward slash to provide several URL features of increasing abstraction. For example, the URL “blog.wired.com/business/2008/10/googles-mail-go.htm” may be used to generate such features as “blog.wired.com/business/2008/10,” “blog.wired.com/business/2008,” “blog.wired.com/business,” and “blog.wired.com.” Additional features may be generated by truncating the domain name at each dot. For example, “blog.wired.com” may be used to generate the additional features “wired.com,” “com.”
Features may also be generated from the URLs of search engines. For example, keywords pertaining to the subject matter of the search may be extracted from the search engine URL and each keyword may be a new feature. In other embodiments, additional features may also be generated from the content of Web pages. For example, if the title of a Web page is available, each word in the title may be a new feature. In some exemplary embodiments, the Web page content may be available in the clickstream data. In other embodiments, the Web page content may be obtained by accessing the Web page and extracting the Web content directly from the Web page. Each of the features may be associated with the same user ID as the original URL from which the feature was generated.
At block 206, the augmented clickstream data may be entered into a data structure, such as a matrix, of user IDs and features to prepare the data for the segmentation processing. An exemplary segmentation technique may be better understood with reference to
Returning to
Similarly, if a particular column of the matrix contains a high number of entries, indicating that a large number of the users have visited the Web page corresponding with the feature, then the column for that feature may also be eliminated. More specifically, if a particular feature has been visited by too many users, the segmentation of Web content may not yield statistically significant data with respect to that feature, i.e., user IDs may not be able to be distinguished by that feature. Accordingly, a number ‘M’ (such as 100000, 10000, 1000, or smaller) may be specified such that any column with more than M entries may be eliminated. For example, with reference to
At block 210, the segment information is generated from the augmented and filtered clickstream data by segmenting the user IDs and the features into several groups based on the distribution of matrix entries. The user IDs may be grouped together based on the similarity of each user IDs distribution of column entries. Further, the features may be grouped together based on the similarity of each feature's distribution of row entries. The resulting segment information may include groupings of user IDs and features, referred herein as “segments,” that may be used to identify groups of user IDs that show similar interests and groups of associated Web pages that provide similar content. The segment information may be generated by an automated analysis of the clickstream data matrix, for example, using a statistical analysis such as clustering, co-clustering, information-theoretic co-clustering, and the like. Other machine learning techniques or stochastical techniques may also be used. An exemplary segmentation technique may be better understood with reference to
As shown in the exemplary matrix of
As shown in table 1, each segment may include a group of user IDs that are similar in terms of the Web pages they have been used to access. Each segment may also include a group of Web pages that are commonly visited from the user IDs included in the segment. For purposed of the present description, Web pages located in the same segment, thus showing similar access visitation patterns, are referred to as “co-located.” The similarity of the visitation patterns of the user IDs included in each segment may be used to target those user IDs as well as other user IDs with Web content that is more likely to be of interest to an individual. It should be clearly recognized that the term “similarity” may generally refer to co-located pages.
In some embodiments, each segment may be associated with a segment identifier, which may be a category name applied by a human analyst. The segment identifier may also be an automatically generated identification code. It can be appreciated from the foregoing example, that the similarity between the user IDs and the Web pages can be ascertained without knowing the meanings of the words contained in the URL or the content of the Web pages. In other words, the process of generating the segment information may not involve human lexical interpretation. Furthermore, it will be appreciated that the process described above may result in a large number of segments, for example, tens, hundreds, or thousands of segments.
As previously noted, the graphical representation of the word/Website matrix of
At block 212, the segment information may be used to provide targeted Web content to a user, for example, from a Website 106, a search engine 104, or an advertising server. Furthermore, the segment information may be analyzed by a person, or may be used directly without human analysis, to determine the content of a Web page. In one exemplary embodiment, the segment information may be analyzed by a person to identify patterns in Internet usage, and the results of the human analysis may then be used to tailor the content of specific Web pages or Websites. For example, analysis of the segment information may reveal two or more co-located Web pages, indicating that user IDs that visit one of the co-located Web pages also tend to visit the other co-located Web pages. Therefore, a particular Web page may be adapted to display Web advertising related to the other co-located Web pages. For example, referring to Table 1, the Web page “blog.wired.com/business” may be adapted to provide a Web advertising link to the Web page “http://www.usatoday.com/money/smallbusiness,” and vice-versa.
Additionally, the segment information may be inspected to determine an intuitive category name for each segment based on the apparent subject matter encompassed by each segment. For example, referring to Table 1, Segment 1 may be assigned the category name “business.” The assignment of category names may provide market analysts with more intuitive information about the segments without inspecting the URLs within each segment. Furthermore, the category names may also be used in an automated process for delivering Web content. In other embodiments, the segment information may be automatically assigned an identification code rather than a category name.
In an exemplary embodiment of the present invention, an automated process for generating personalized Web content may include determining content of a Web page based on Web pages that are co-located within the segment information, i.e., represent similar content. Referring also to
In another exemplary embodiment of the present invention, an automated process for generating Web content may include targeting a particular user ID accessing a Website based on the segment or segments to which the user ID belongs. Referring also to
The various software components discussed herein can be stored on the tangible, machine-readable medium 400 as indicated in
Although shown as contiguous blocks, the software components can be stored in any order or configuration. For example, if the tangible, machine-readable medium 400 is a hard drive, the software components can be stored in non-contiguous, or even overlapping, sectors.