The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer system 100 may include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer system 100 and includes both volatile and nonvolatile media. For example, computer-readable media may include volatile and nonvolatile computer storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer system 100. Communication media may include computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For instance, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
The system memory 104 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 106 and random access memory (RAM) 110. A basic input/output system 108 (BIOS), containing the basic routines that help to transfer information between elements within computer system 100, such as during start-up, is typically stored in ROM 106. Additionally, RAM 110 may contain operating system 112, application programs 114, other executable code 116 and program data 118. RAM 110 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by CPU 102.
The computer system 100 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, discussed above and illustrated in
The computer system 100 may operate in a networked environment using a network 136 to one or more remote computers, such as a remote computer 146. The remote computer 146 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer system 100. The network 136 depicted in
The present invention is generally directed towards a system and method for hierarchical segmentation of websites by topic. In general, hierarchical topic segmentation may provide a segmentation of a website into topically-cohesive regions that may respect the hierarchical structure of the website and may effectively describe the topical content of the website for a user. Each page of the website may be assumed to have a topic label or a distribution on topic labels generated using a standard classifier. These distributions, along with a hierarchical arrangement of all the pages in the site, may be provided to an algorithm that may perform hierarchical segmentation of the website by topic. The algorithm may output the segments of the website representing the coherent topics, for instance, by returning a set of segmentation points that may optimally partition the site.
The present invention may also provide a set of cost measures characterizing the benefit accrued by introducing a segmentation of the website based on the topic labels. As will be seen, an objective function for the partitioning may be considered a combination of two competing costs: the cost of choosing the nodes as segmentation points and the cost of assigning the leaves to the closest chosen nodes. The node selection cost may model the requirements for a node to serve as a segmentation point, while the cohesiveness cost may model how the selection of a node as a segmentation point may improve the representation of the content with a subtree rooted at the node. As will be understood, the various block diagrams, flow charts and scenarios described herein are only examples, and there are many other scenarios to which the present invention will apply.
Turning to
In various embodiments, a client computer 202 may be operably coupled to one or more servers 208 by a network 206. The client computer 202 may be a computer such as computer system 100 of
The server 208 may be any type of computer system or computing device such as computer system 100 of
Each of the analyzers and engines included in the server 208 may be any type of executable software code such as a kernel component, an application program, a linked library, an object with methods, or other type of executable software code. The server 208 may additionally be operably coupled to storage 218. The storage 218 may be any type of computer-readable media and may store information about a website directory such as a uniform resource locator (“URL”) 220, segment information such as a segment ID 222, and information about topics such as topic ID 224. In an embodiment, a record may be stored that may associate a website location such as a URL 220 with a segment ID 222 and a topic represented by the segment ID 224.
There may be a variety of applications which may use hierarchical segmentation of websites by topic. Various results that may be currently applied to websites could more naturally be applied to topically-focused segments. First of all, web search may already incorporate special treatment for pages that are known to possess a given topic—for instance, many engines provide a link to the topic in a large directory such as the Yahoo! Directory, Wikipedia, or the Open Directory Project. These approaches may naturally be extended when several pages from a search result list lie within a topically-focused segment. Second, the result segments provide a simple and concise site-level summary to help users who wish to understand the overall content and focus of a particular website. Additionally, a host such as an ISP may contain many individual websites, and a topical segmentation may provide a useful input to help determine the appropriate granularity of a site. Topical segmentation of a website may also be applicable for website classification. Website classification has been addressed using primarily manual methods since the early days of the web, in part because sites typically do not contain a single uniform class. Topical segmentation of a website may offer an important starting point for solving website classification problems.
For clarity, it may be important to note the difference between segmentation of a website and classification of a website. General website classification tries to assign topics to web sites by employing features that are broad and varied. A few example features for this broader problem may include the topic of each page, the internal hyperlinks on the site, the commonly link-to entry points to the site, with their anchor-text, the general external link structure, the directory structure of the site, the link and content templates present on the site, the description, title, and h1-6 tags on key pages on the site, and so forth. The final classes in a website classification problem may be distinct from the classes employed at the page level. Hierarchical segmentation of a website by topic, on the other hand, specifically focuses on aggregating the topic labels on web pages into subtrees according to the hierarchy of a site, in order to convey information such as, “This entire sub-site may be about Sports.” Thus, hierarchical segmentation of a website by topic may not only address the problem of determining whether and how to split the site, but may also be the beginning of a broader research problem of classifying websites using rich features. The broader problem of classifying websites may be of great interest in both binary cases (such as understanding whether the content of a website may be spam, porn, or some other category of content) and multi-class cases (such as what topics does this website represent). A solution to hierarchical segmentation of a website by topic may therefore be essential to fully address the more general site classification problem.
For many websites, hierarchical segmentation of a website by topic may effectively describe the topical content of the website for a user. If the website may be topically homogeneous, the URL of the website and a topic label representing the content may be provided to a user. However, most websites are not typically homogeneous, and, in fact, the organization of topics within directories may determine the best way to summarize site content for the user. For instance, consider the two hypothetical websites shown in
In general, hierarchical topic segmentation may provide a segmentation of a website into topically-cohesive regions that may respect the hierarchical structure of the website.
More particularly, a directory structure of a website may be modeled by a rooted tree whose leaves may be individual pages. If internal nodes may also correspond to pages, internal nodes may be modeled using the standard “index.html” convention. The hierarchical structure of a website may be derived from the tree induced by the URL structure of the site, or mined from the intra-site links or the page content of the site. There may be a page-level classifier that may assign class labels or a distribution on class labels to each page of the directory structure. This may additionally induce a distribution on the internal nodes of the tree as well, by uniformly combining the distribution of all descendant pages. The notion of cohesiveness of a subtree may be based upon an agreement between each leaf with the distribution at the root of the subtree. More formally, consider T to be a rooted tree with n leaves where leaf(T) may denote the leaves of T and root(T) may denote its root. Also consider Δ to be the maximum degree of a node in T. Considering L to be the set of class labels, assume that each leaf x in the tree T may have a distribution, px over L, that may have been generated by some page-level classifier. Given that px(i) may denote the probability that leaf x may belong to class label i, the distribution of labels at an internal node u with leaves, leaf(u) in the subtree rooted at u, may be defined as follows:
A subset S of the nodes of T may be defined herein to be a segmentation of T if, for each leaf x of T, there may be at least one node y∈S, such that x may be a leaf in the subtree rooted at y. For example, S may be a segmentation if root(T)∈S. Given a parameter k, a segmentation of size at most k may be selected where each of the components may be cohesive. For a leaf, x∈leaf(T), consider Sx∈S to be the first element of S on the ordered path from x to root(T). In this case, x may be said to belong to Sx, and a cohesiveness cost d(x, Sx) may be defined to capture the cost of assigning x to Sx. Further, a node selection cost c(y,S) may be defined to give the cost of adding y to S. The overall cost of a particular segmentation S may then be defined as:
where β may be a constant controlling the relative importance of the node selection cost and the cohesiveness cost. The algorithms described below may then find the lowest-cost segmentation, given functions c(·) and d(·) representing the problem instance. These algorithms may be based on a general dynamic program that may optimize the objective function of
After the website directory may be converted into a binary tree, it may then be determined whether to add a node of the tree as a segment representing a topic at step 504 to the segmentation. In an embodiment, dynamic programming may be used for determining whether to add a node of the tree as a segment representing a topic to the segmentation. For example, consider S to denote the current solution set. Furthermore, consider C(x, S, k) to be the cost of the best segmentation of the subtree rooted at node x using a budget of k, given that S may be the current solution. Recall that Sx, if it may exist, may be the first node along the ordered path from x to the root of the tree T in the current solution S. If Sx exists, then nodes in the subtree under x may be covered by Sx, with the cost
The dynamic program may be invoked as C(root(T),φ,k). Consider x1 and x2 to denote the two children of x. The cost of the best subtree rooted at each of the two children of x using a budget of k/2 may be recursively evaluated until reaching a leaf node. Accordingly, the cost of the best subtree for the dynamic program may be defined as:
Upon reaching the leaves in the binary tree, it may be determined whether the leaves in the subtree rooted at an internal node may be assigned to the internal node at step 506. The base case for the dynamic program upon reaching a leaf may be to evaluate C(x,S, k) where x∈leaf(T) and k>0. In an embodiment where leaves may not be included in the solution, the cost of C(x,S,k) may be set to be ∞. In various other embodiments where the leaves of T may be permitted to be part of the solution, the cost may be defined as:
In the case where there may not be any remaining budget k, the base case for the dynamic program upon reaching a leaf may be to evaluate C(x,S,0) which may be defined as:
The result of evaluating the combined cost of adding a node as a segmentation point and the cost of assigning leaves in the subtree rooted at the node to the segment may then be used to complete evaluation of the dynamic program for determining whether to add the subtree rooted at the parent node of the leaf node to the segment. After the nodes of the tree have been added as the k segments, processing may be finished. There may be knd lgΔ entries in the dynamic programming table and each update of an entry may take O(k) time. So, the total running time of the dynamic program may be O(k2nd lg Δ).
Notice that the node selection cost c(·) may be helpful in an embodiment for incorporating heuristic choices and requirements. For instance, setting c(·) to be sufficiently high for two nodes, one of which may be a parent of the other, when the two nodes are very close in distribution, can be used to ensure that nodes added as segments provide extra information to the user.
In an embodiment, the number of segments for the cost function C(x, S, k) may be initialized to a default number. At most k segments may be automatically discovered by running the dynamic program. In practice, the default number of segments may therefore be initialized to be larger than an estimated number of segments expected in the website. For instance, the default number of segments may be initialized to 10 if the expected number of segments may be 7.
There may be different variants of the node selection cost c(·)and the cohesiveness cost d(·) that may be used for the equation of the overall cost,
In an embodiment, the cohesiveness cost measure may be based on the Kullback-Leibler (“KL”)divergence in information theory. For every page x and the node Sx to which it may belong, the cohesiveness cost of the assignment of x to Sx may be defined to be:
The KL-divergence is the relative entropy of two distributions px and pSx over an alphabet L that may represent the average number of extra bits needed to encode data drawn from px using a code derived from pSx. This may correspond to minimizing the wastage in description cost of leaves of the tree using the internal nodes that are selected. Furthermore, using the KL-divergence as a measure of distance may be equivalent to assuming that the class distribution at the leaves may have been generated from a multinomial model over classes at the internal node. (See for example A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, Clustering with Bregman divergences, Journal Machine Learning Research, 6:1705-1749, 2005.) These properties may make the KL-divergence a good choice for the cohesiveness cost.
In another embodiment, the cohesiveness cost measure may be based on the squared Euclidean distance. The sum of squared Euclidean cost has been extensively used in many applications and may be considered equivalent to modeling the internal node as a multidimensional Gaussian distribution. (See again for example A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, Clustering with Bregman divergences, Journal Machine Learning Research, 6:1705-1749, 2005.) The distance between a leaf x (web page) and an internal node Sx (subdirectory) may be computed using the squared Euclidean distance between the corresponding class distributions, which may be defined to be:
In yet a third embodiment, the cohesiveness cost measure may be based on a cosine cost measure. For instance, the negative cosine dissimilarity measure may be employed as a cohesiveness cost, as follows:
The cosine cost measure is well-known in the art for clustering documents in information retrieval. (See, for example, A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra, Clustering on the Unit Hypersphere Using von Mises-Fisher Distributions, Journal of Machine Learning Research, 6:1345-1382, 2005.)
In addition to different variants of the cohesiveness cost d(·), there may be different variants of the node selection cost c(·) that may be used for the equation of the overall cost, C(x,S,k)=c(x,S)+d(x,Sx). In an embodiment, the node selection cost c(·) may be based on penalizing a node that may be added as a new element of S if it provides little information beyond its closest parent already in the segmentation solution. A related cost measure, referred to as information gain ratio, in the context of decision tree induction was introduced by Quinlan. (See J. R. Quinlan, Induction of Decision Trees, in J. W. Shavlik and T. G. Dietterich, editors, Readings in Machine Learning, Morgan Kaufmann, 1990, originally published in Machine Learning 1:81-106, 1986.) To implement this condition, an α-measure may be defined. Consider, first, T to be a tree consisting of subtrees T1, . . . , Ts. There may be two possible encoding schemes to encode the label of a particular leaf of T. In the first scheme, the label may be communicated using an optimal code based on the distribution of labels in T. In the second scheme, it may first be communicated whether or not the designated leaf lies in T1, and then the label may be encoded using a tailored code for either T1 or T \T1 as appropriate. The second scheme may correspond to adding T1 to the segmentation. Its overall cost may not be better than the first, but if T1 may be completely distinct from T \T1, then the cost of the second scheme may be equivalent to the first. Consider p1=|T1|/|T| to be the probability that a uniformly-chosen leaf of T may lie in T1. Then the cost of communicating whether a leaf may lie within T1 may be H(p1). In a worst case, T1 may look identical to T \T1 and the second scheme may be H(p1) bits more expensive than the first. In such a case, the information about the subtree may provide no leverage to the user. The value of subtree T1 relative to its parent may be characterized, therefore, by asking where on the extreme between H(T) and H(T)+H(p1) the cost of the second scheme may lie. With this intuition in mind, the cost measure may be formally defined. Consider x to denote the current node considered to be added to the solution S. Recall that Sx may be its nearest parent that is already a part of the solution S. Assuming Sx exists, consider y to denote Sx. Then consider x′ to be a hypothetical node such that leaf(Tx′)=leaf(Ty)\leaf(Tx), i.e., x′ may include the leaves under the subtree rooted at y but not x. Furthermore, assume n=|leaf(Ty)|, nx=|leaf(Tx)|, and nx′=|leaf(Tx′)|. The split cost for the binary entropy may be defined as H2(nx/n). Using the split cost, the α-measure may be defined to be:
It may be seen that α may represent values between 0 and 1, with lower values indicating a good split. The cost of adding a node to the solution may then be:
c(x,S)=c(x,y)=α(x,y)·nx.
One requirement of using the α-measure in the dynamic program may be to select the root of T, i.e., root(T)∈S, in order to compute the cost of adding additional internal nodes. For some websites, the root directory may contain a large number of files that may not be made part of the solution on their own right and, therefore, may need the root as a selected node to cover them. In general, the α-measure may act as a regularization term in the overall cost function
that may regulate the number of segments selected and may help select correct segments.
In practice, varying values of β between 0 and 1 for the equation of overall cost,
may result in obtaining solutions with different precision and recall values for manually labeled websites depending upon the combination of cohesiveness cost and node selection cost. Configurations with a higher value of β may find fewer segments in a website than lower values, since higher values of β may bias the over cost function towards not adding a node. Such configurations may be expected to have higher precision but low recall. Configurations with lower values of β may be expected to achieve low precision and higher recall. In some configurations, the combination of the cohesiveness cost measure based on the KL-divergence and the node selection cost measure based on the α-measure may have the desirable property of giving good results over a much larger range of β than using the cohesiveness cost measure based on either the squared Euclidian distance or the cosine cost measure.
Thus the present invention may flexibly provide a framework for incorporating different variants of the node selection cost and the cohesiveness cost to be used. The system and method may apply broadly to provide a simple and concise site-level summary to help users who wish to understand the overall content and focus of a particular website amenable to hierarchical segmentation by topic. Moreover, the system and method may be applied to extend existing online search applications which may provide a link to a topic within a topically-focused segment of a large directory. Furthermore, a topical segmentation may provide a useful guide to determine the appropriate granularity of a site hosting many aggregated individual websites. Those skilled in the art will appreciate that topical segmentation may be applicable for these and other applications, such as website classification.
As can be seen from the foregoing detailed description, the present invention provides an improved system and method for hierarchical segmentation of websites by topic. An organization of topics may be determined within directories of a website, the hierarchical arrangement of the web pages in the website may be segmented by topic, and the segments representing regions of coherent topics in the website directory may be output. The present invention also provides a set of cost measures characterizing the benefit accrued by introducing a segmentation of the website based on topics. Advantageously, the present invention may thus provide a flexible framework to allow implementations incorporating specific heuristic choices and requirements. As a result, the system and method provide significant advantages and benefits needed in contemporary computing and in online applications.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.