Supervised learning methods (such as decision trees, classification, etc.) are known to, for example, predict a value of a variable of an unknown instance (such as content-related features of a previously-unvisited web page) based on properties of known instances (such as content-related features of previously-visited web pages). Conventionally, supervised learning methods utilize supervision to generate training data. Using such supervised learning methods relative to web page content-related features can require a large amount of training data and, therefore, such an approach may generally not be efficiently scalable.
In accordance with an aspect, a decision tree may be determined that is a site map for a domain of web pages. A clustering of a plurality of web pages of a domain is determined, in an unsupervised fashion, based on content-related features of the plurality of web pages. Each determined cluster includes a plurality of web pages, each of the plurality of web pages characterized by a resource locator and each of the resource locators being characterized by at least one resource locator token. The clustering is processed to organize indications of the content-related features of the plurality of web pages into a decision tree characterized by a plurality of nodes, each node characterized by a feature and a value, the feature being at least one of the resource locator tokens and the value being a value of that resource locator token.
The inventors have realized the desirability of determining an organization of web pages by building a decision tree using training data that has been automatically generated in an unsupervised manner. As a result, the generation of such decision trees may be highly scalable, such as may be desirable for use in analyzing web pages of the world wide web.
See, for example, “Induction of decision trees,” by J R Quinlan in Machine Learning, 1986. Examples of such analysis may include URL normalization, which includes generating a representative URL for a group of URLs. Another examples of such analysis may include duplicate detection: This includes detecting duplicate pages on the web in a scalable fashion.
A scalable crawler may use the decision tree to detect duplicate pages from the URLs of the pages without actually crawling to those pages. By using the decision tree to group and aggregate features, search relevance can be improved. The decision tree may also be used to in advertisement targeting, to serve relevant advertisements on unseen pages.
In general, the decision tree provides high recall and precision information extraction.
Broadly speaking, the training data may be generated by determining, in an unsupervised fashion, clusters of a plurality of “training” web pages based on content-related features of the plurality of web pages, such as content on the web page by stripping of the HTML tags. Content of the web page depending upon the application could also include the HTML tags. Each determined cluster includes a plurality of web pages, each of the plurality of web pages characterized by a resource locator and each of the resource locators having at least one resource locator token.
Information of the clustering is used as training data for generating a decision tree. More particularly, the clusters are processed to organize indications of the content-related features of the plurality of web pages into a decision tree characterized by a plurality of nodes. Each node of the decision tree is characterized by a feature and a value. The feature is at least one of the resource locator tokens, and the value is a value of that at least one resource locator token.
We first describe a general approach to building a decision tree using training data that has been automatically determined in an unsupervised manner. We then provide an illustrative example. The general approach is described with reference to
An analysis process 118 processes the received web page content saved in storage 116. More specifically, the analysis process 118 includes processing to cluster web pages based on characteristics of the web page content. The clustering is an unsupervised process. In one example, the clustering of the analysis process 118 is generally for web pages that result from access requests corresponding to a particular domain.
Having determined the clusters, the web page content in storage 116 is indicated with a result of the cluster determination. Such an indication can have various uses. In the
We now discuss, with reference to
At step 204, the web pages are clustered based on content to cluster together web pages having similar content. More particularly, at least some of the web pages of the particular domain are clustered using an unsupervised clustering algorithm. Clustering of web pages is known. For example, the paper entitled “Syntactic Clustering of the Web,” by Broder et al., describes clustering using shingling to determine near-duplicate clusters. In general, any technique that clusters based on content similarity/dissimilarity may be acceptable. The paper entitled “A Short Survey of Structure Similarity Algorithms,” by D. Buttler, describes a number of known clustering algorithms. Using a shingling technique, in particular, web pages whose similarity measure is above a particular threshold (such as an 8/8 shingle match) may be clustered together. See, also, U.S. Patent Publication 20060112089 “Methods and apparatus for assessing web page decay” by Broder; Andrei Zary; et al and U.S. Pat. No. 6,119,124, entitled “Method for Clustering Closely Resembling DataObjects” by Andrei Broder, Steve Glassman, Greg Nelson, Mark Manasse, and Geoffrey Zweig.
Consider an example of the particular domain is foo.com, which has no other mirror sites and, hence, the domain name itself is the webmaster-id. Table 2 lists some example URLs for this domain, as well as an example clustering result (in this example, indicated by a cluster identification).
That is, the twelve retrieved web pages have been clustered into four clusters of three web pages each. Each shingle has been given an identification of 01, 02, 03 or 04. Still with reference to
Thus, for example, building the decision tree in a bottom-up manner, the leaf nodes of the decision tree may each be characterized by a feature that is common to all the URLs of a particular cluster, as illustrated in
In
Features corresponding to the above URL and their values are shown below:
Referring to
One key for the cluster 302 is “cat,” for which the only value is “sports” with a count of three. Another key for the cluster 302 is “subcat,” for which the only value is “football,” again with a count of three. Another key for the cluster 302 is “page id.” The key “page id” has three values in the URLs of the cluster. One value is “1,” with a count of 1. Another value is “2,” with a count of 1. A final value for “page id” is “3,” with a count of 1.
To generate the next level up, it is determined what other keys highly predict (are highly correlated to) various combinations of already-created nodes (i.e., of clusters 302, 304, 306 and 308), in general, ignoring the features used to determine the leaf nodes. Put another way, each node at the next level up is defined to specify a collective characterization of URLs of lower level nodes that are constituents of that next level up node. The combinations of clusters that can be highly predicted (or even most highly predicted) are designated as the nodes at the next level up. Thus, for example,
It is further noted that it is known as well how to build a decision tree from top down. In one example, the top-down process starts with a dummy root node, including all of the URLs to be mapped (along with their labels) and splits the node based on the URL tokens to form multiple child nodes. These child nodes are further considered for top-down split until the nodes satisfy criteria like homogeneity (if the node is homogenous, no need to further split), minimum number of URLs (if the node has fewer URLs than a threshold, it is decided to not split that node further), and perhaps other criteria. It can be seen that the top down process is similar to the bottom up process. However, in general, some steps of the bottom up process can be parallelized, which can lead to more efficient processing. For example, the bottom up process, due to its parallelization, may be implemented using a scalable architecture such as MapReduce.
We have described a system/method to determine an organization of web pages by building a decision tree using training data that has been automatically generated in an unsupervised manner. Embodiments of the present invention may be employed to facilitate determination one ore more similarity classes in any of a wide variety of computing contexts. For example, as illustrated in
According to various embodiments, a method of determining the similarity class such as described herein may be implemented as a computer program product having a computer program embodied therein, suitable for execution locally, remotely or a combination of both. The remote aspect is illustrated in
The various aspects of the invention may also be practiced in a wide variety of network environments (represented by network 612) including, for example, TCP/IP-based networks, telecommunications networks, wireless networks, etc. In addition, the computer program instructions with which embodiments of the invention are implemented may be stored in any type of computer-readable media, and may be executed according to a variety of computing models including, for example, on a stand-alone computing device, or according to a distributed computing model in which various of the functionalities described herein may be effected or employed at different locations