1. Field
The subject matter disclosed herein relates to data processing, and more particularly to data processing methods and systems that measure entropy and/or otherwise utilize entropy measurements.
2. Information
Data processing tools and techniques continue to improve. Information in the form of data is continually being generated or otherwise identified, collected, stored, shared, and analyzed. Databases and other like data repositories are common place, as are related communication networks and computing resources that provide access to such information.
The Internet is ubiquitous; the World Wide Web provided by the Internet continues to grow with new information seemingly being added every second. To provide access to such information, tools and services are often provided which allow for the copious amounts of information to be searched through in an efficient manner. For example, service providers may allow for users to search the World Wide Web or other like networks using search engines. Similar tools or services may allow for one or more databases or other like data repositories to be searched.
With so much information being available, there is a continuing need for methods and systems that allow for pertinent information to be located or otherwise identified in an efficient manner.
Non-limiting and non-exhaustive aspects are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various figures unless otherwise specified.
Techniques are provided herein that may be used to allow for pertinent information to be located or otherwise identified in an efficient manner. These techniques may, for example, allow for more efficient searching of items that may be classified into a taxonomy having a hierarchical structure by measuring entropy associated with the classification distribution and inherent hierarchical dependency.
First device 102, second device 104 and third device 106, as shown in
Similarly, network 108, as shown in
As illustrated, for example, by the dashed lined box illustrated as being partially obscured of third device 106, there may be additional like devices operatively coupled to network 108.
It is recognized that all or part of the various devices and networks shown in system 100, and the processes and methods as further described herein, may be implemented using or otherwise include hardware, firmware, software, or any combination thereof.
Thus, by way of example but not limitation, second device 104 may include at least one processing unit 120 that is operatively coupled to a memory 122 through a bus 128.
Processing unit 120 is representative of one or more circuits configurable to perform at least a portion of a data computing procedure or process. By way of example but not limitation, processing unit 120 may include one or more processors, controllers, microprocessors, microcontrollers, application specific integrated circuits, digital signal processors, programmable logic devices, field programmable gate arrays, and the like, or any combination thereof.
Memory 122 is representative of any data storage mechanism. Memory 122 may include, for example, a primary memory 124 and/or a secondary memory 126. Primary memory 124 may include, for example, a random access memory, read only memory, etc. While illustrated in this example as being separate from processing unit 120, it should be understood that all or part of primary memory 124 may be provided within or otherwise co-located/coupled with processing unit 120.
Secondary memory 126 may include, for example, the same or similar type of memory as primary memory and/or one or more data storage devices or systems, such as, for example, a disk drive, an optical disc drive, a tape drive, a solid state memory drive, etc. In certain implementations, secondary memory 126 may be operatively receptive of, or otherwise configurable to couple to, a computer-readable medium 128. Computer-readable medium 128 may include, for example, any medium that can carry and/or make accessible data, code and/or instructions for one or more of the devices in system 100.
Second device 104 may include, for example, a communication interface 130 that provides for or otherwise supports the operative coupling of second device 104 to at least network 108. By way of example but not limitation, communication interface 130 may include a network interface device or card, a modem, a router, a switch, a transceiver, and the like.
Second device 104 may include, for example, an input/output 132. Input/output 132 is representative of one or more devices or features that may be configurable to accept or otherwise introduce human and/or machine inputs, and/or one or more devices or features that may be configurable to deliver or otherwise provide for human and/or machine outputs. By way of example but not limitation, input/output device 132 may include an operatively configured display, speaker, keyboard, mouse, trackball, touch screen, data port, etc.
With regard to system 100, in certain implementations first device 102 may be configurable, for example, using a browser or other like application, to seek the assistance of second device 104 by providing or otherwise identifying a query that second device 104 may then process. For example, one such query may be associated with a search engine provider service provided by or otherwise associated with second device 104. In response to such a query, for example, second device 104 may then provide or otherwise identify a query response that first device may then process.
Here, for example, to process such a query second device may be configured to access stored data associated with various items that may be available within system 100 and which may be of interest or otherwise associated with information included within the query. The stored data may, for example, include data that identifies the item, its location, etc. By way of example but not limitation, the item may include a document or web page that is accessible from, or otherwise made available by, third device 106 as part of the World Wide Web portion of the Internet.
Continuing with this example, second device 104 may be configured to examine the stored data in such a manner as to identify one or more items deemed to be relevant to the query. By way of example but not limitation, second device 104 may be configurable to select items deemed relevant to such a query based, at least in part, on scores assigned to or otherwise associated with potential candidate items. Such scores (e.g., PageRank, etc.) and/or other like useful search engine data may, for example, result from other processes conducted by second device 104 or other devices. For example, one or more devices may be configurable to identify items, classify. items, and/or score the items as needed to provide or maintain additional (e.g., perhaps local) stored data that may be accessed by a search engine in response to a query.
Reference is now made to
Process 200 may, for example, include at least one item identifying procedure 202 that generates or otherwise identifies item data 204. By way of example but not limitation, item identifying procedure 202 may include one or more web crawlers or other like processes that communicate with applicable devices coupled to network 108 and operate to gather information about items available through or otherwise made accessible over network 108 by such devices. Such processes and other like processes are well known and beyond the scope of the present subject matter.
Item data 204 may, for example, include information about the item such as identifying information, location information, etc. Item data 204 may, for example, include all or a portion of the text or words associated with information that may be included in the item.
As used herein, the term “item” is meant to include any form or type of data that may be communicated. By way of example but not limitation, an item may include all or part of one or more web pages, documents, files, databases, objects, messages, queries, and the like, or any combination thereof.
Process 200 may, for example, include at least one classifying procedure 206 that accesses item data 204 and generates or otherwise identifies taxonomic data 208 associated with the item. By way of example but not limitation, classifying procedure 206 may be configurable to classify all or part of item data 204 into a taxonomy having a hierarchical structure. For example, at least a portion of one exemplary taxonomy may include a tree or sub-tree structure having a root node that is superior to one or more levels comprising one more inner nodes that are superior to a plurality of leaf nodes. Classifying procedure 206 may, for example, be configurable to assign distribution data 208a to such leaf nodes. For example, in certain implementations distribution data 208a may include a distribution value (e.g., a normalized value) or the like that is assigned to a leaf node. In other implementations, for example, distribution data 208a may include a probability associated with individual leaf nodes.
Taxonomic data 208 may, for example, include dependency data 208b that is associated with the hierarchical structure. For example, dependency data 208b may include data associated with the distribution and/or arrangement of inner nodes within the hierarchical structure.
An entropy measurement procedure 210 may be configurable to access taxonomic data 208 and generate or otherwise identify entropic data 212 associated with the taxonomic data and hence the item data. As illustrated in
Entropy measurement procedure 210 may be configurable to access distribution data 208a and to either access and/or otherwise establish dependency data 208b (e.g., as shown within entropy measurement procedure 210). Dependency data 208b may, for example, be established based, at least in part, on the hierarchical structure, or an applicable portion thereof, as per the taxonomy applied by classifying procedure 206 and with consideration of the distribution data 208a.
As illustrated, entropy measurement procedure 210 may, for example, include the application of at least one cost function 226 in establishing dependency data 208b. As illustrated, entropy measurement procedure 210 may, for example, include the application of at least one weighting parameter 228 in establishing dependency data 208b. Several exemplary weighting parameters and cost functions, e.g., which may be used to establish weighting parameters, are described in greater detail below.
Also, as described in greater detail below, a tree entropy operation or formula may, by way of example but not limitation, be applied by entropy measurement procedure 210 such that the resulting entropic data 212 provides a measure of the extent to which the item is topic-focused with regard to the topic of the taxonomy.
In certain implementations, all or portions of dependency data 208b may be provided in taxonomic data 208, for example, as generated by classifying procedure 206 or the like. For example, it may be beneficial for classifying procedure 206 to be further configurable to perform at least some of the processing associated with the establishment of dependency data 208b (e.g., while establishing distribution data 208a). In other implementations, for example, all or portions of dependency data 208b may be established by measurement procedure 210.
With respect to exemplary process 200, entropic data 212 which may include, for example, tree entropy value 212a, which may then be provided or otherwise made accessible to an item scoring procedure 214. Item scoring procedure 214 may, for example, be configurable to establish or otherwise identify item score data 218. Item scoring procedure 214 may, for example, be configurable to establish item score data 218 based, at least in part, on entropic data 212 and one or more other parameters 216 (e.g., a PageRank or related metric(s), etc.). In certain implementations, for example, item score data 218 may include a single numerical score associated with the item identified in item data 204.
A search engine procedure 220 may be configurable to receive or otherwise access item score data 218 and based, at least in part, on item score data 218 provide or otherwise identify a query response 224 in response to a query 222.
Thus, as illustrated in the preceding example, in accordance with certain aspects of the methods and systems presented herein, entropy measurement techniques or resulting entropic measurements may be used to possibly refine or otherwise further support in some manner a data query, search engine, or other like data processing service, system, and/or device.
Reference is now made to
With this example, it is illustrated that entropy measurement techniques or resulting entropic measurements may be used to possibly test or otherwise study the performance of classifying procedure 206. Thus, for example, item data 304 may be carefully selected or otherwise specifically created to “focus” within a given taxonomy in a desired manner. For example, item data 304 may be thought to be very focusable or conversely barely focusable on the taxonomy. As such, once classifying procedure 206 has generated taxonomic data 308, tree entropy procedure 210 may be employed to generate entropic data 312, which may then be examined to judge the performance of classifying procedure 206.
Attention is now drawn to
As shown process 400 may, for example, include classifying procedure 206 that accesses item data 204 and establishes taxonomic data 208, and a classifying procedure 406 that accesses second item data 404 and establishes taxonomic data 408. Here, for example, the classifying procedures 206 and 406 may be the same or different. Process 400 may include, for example, a divergence measurement procedure 402 (which may include an entropy measurement procedure 210) that accesses taxonomic data 208 and taxonomic data 408 to establish a divergence value 410. Process 400 may include, for example, a search engine procedure 220 that accesses at least the divergence value 410 in generating a query response 412 in response to query 222.
In process 400, divergence measurement procedure 402 may, for example, be configurable to measure similarity between the item associated with item data 204 and the second item associated with second item data 404. This measurement may be provided in divergence value 410, and may be used by search engine procedure 220 to adjust or otherwise affect query response 412. For example, in certain implementations, second item data may include or otherwise be based, at least in part, on query 222 such that the resulting tree divergence value 410 may represent how similar the item associated with item data 204 is to the query. In certain situations, it may be desirable for query response 412 to identify some items that do not appear to match as closely as other items that are identified. Thus, for example, if query 222 includes the term “mouse”, then it may be beneficial for the query response to identify some items that appear to focus on an “animal” mouse and others that appear to focus on “computer hardware” related mouse devices.
At this point attention is drawn to
In 502, an item may be identified for classification into a taxonomy having a hierarchical structure. In 504, the item may be classified and taxonomic data including at least distribution data established. In 506, entropic data for the item may be determined based, at least in part, on the distribution data and established dependency data (e.g., associated with the distribution and hierarchical structure). In 508, a tree entropy value may be identified. In 510, a score value may be determined, for example, based, at least in part, on the tree entropy value from 508 and/or entropic data 506.
In 514, a second item may be identified for classification into the same taxonomy having the same hierarchical structure. In 516, the second item may be classified and taxonomic data including distribution data established. In 518, entropic data for the second item may be determined based, at least in part, on the distribution data and established dependency data. In 520, a tree entropy value may be identified. In 510, a score value may be determined, for example, based, at least in part, on the tree entropy value from 520 and/or entropic data from 518.
In 512, a divergence value may be determined based, at least in part, on the entropic data from 506 and 518. In 510, a score value may be determined, for example, based, at least in part, on the divergence value from 512.
In the following sections, certain exemplary techniques are described that may be used to measure or otherwise determine and/or utilize the entropy of a distribution that takes into account the hierarchical structure of a taxonomy. For example, a formal treatment of “tree entropy” is provided that may be used or otherwise adapted for use in system 100 or portions thereof.
As previously illustrated, one exemplary application of tree entropy may be in the classification of information, such as, where an item may be distributed over various leaf nodes of a given topic taxonomy and it may be desirable to measure or otherwise determine an extent to which the item is topic-focused.
As used herein, entropy refers to a fundamental measure of the uncertainty represented by a probability distribution. By way of example, given a discrete distribution
the Shannon entropy H(
Assuming that a given item has membership in each of n classes (e.g., as assigned by a classifying procedure), in accordance with certain aspects of the methods and systems presented herein, it may be useful to determine to what extent the item is “focused” with respect to the classes. Here, by way of example but not limitation, such an item may be considered “focused” if its membership is “scattered” as little as possible among all the classes.
One approach might be to interpret the membership of the document in each of the n classes as a probability distribution, and use the Shannon entropy of this distribution as a measure of its focus.
However, considering a scenario where the n classes have some relationship among them; for instance, the classes might represent the leaf nodes of a tree (or sub-tree) that correspond to a geographical taxonomy.
In this example, item 600 on the left in
Thus, in accordance with certain aspects of the present subject matter, a more principled and/or systematic technique has been developed that may provide for methods and systems that consider entropic properties of a distribution on a hierarchical structure, such as, for example, dependency data associated with the hierarchal structure of a tree, sub-tree, or the like.
In the following sections an exemplary definition of “tree entropy” is provided by first postulating a set of axioms for tree entropy; these are generalizations of Shannon's axioms to a tree case. The set of axioms leads to a recursive definition from which an explicit functional form of tree entropy may be derived which satisfies the desired axioms. Several interesting properties of tree entropy will be described which tend to demonstrate the robustness of the definition. For example, tree entropy may be invariant under simple transformations of the tree and scaling of the probability distribution. Under an additional yet reasonable assumption on a cost function, for example, tree entropy may be a concave function. Further, under certain conditions tree entropy may be maximized for distributions corresponding to “maximum uncertainty” for the given tree structure. Still further, as will be described, a generalization of KL-divergence may be derived for tree entropy, for example, in the situation wherein two probability distributions over the same tree have the same cost function. Additionally, as shown below, an interpretation of tree entropy may be made, for example, by means of a model for generating symbols (e.g., in the form of or otherwise associated with dependency data).
Specifying natural requirements via a set of axioms and pinning down the functions satisfying these axioms has often resulted in fundamental insights for many problems, some well known ones being the axioms for voting (see, e.g., K. Arrow. Social Choice and Individual Values (2nd Ed.). Yale University Press, 1963), clustering (see, e.g., J. Kleinberg. An impossibility theorem for clustering. In Proceedings of the 16th Conference on Neural Information Processing Systems, 2002), and PageRank (see, e.g., A. Altman and M. Tennennholtz. Ranking systems: The pagerank axioms. In Proceedings of the 6th ACM Conference on Electronic Commerce, pages 1-8, 2005).
While these so-called axiomatic approaches have often been used to refute the existence of an ideal procedure in these problems, as shown below, the result for tree entropy appears to be different in that, after formulating certain rules, one may construct a function that uniquely satisfies them.
In accordance with certain embodiments, tree entropy may, for example, be adapted to measure a cohesiveness of an item when it is classified into a taxonomy. Thus, for example, tree entropy may be used to determine how focused or unfocused such an item is on a topic. One example of such an implementation is shown in
In accordance with certain other embodiments, tree entropy may, for example, be adapted for use in measuring the performance of a classifying procedure. Thus, for example, given an item that is considered to be well focused one may use tree entropy to measure how well the classifying procedure performs in terms of placing such an item at the leaf nodes of a taxonomy hierarchy. One example of such an implementation is shown in
In accordance with still other embodiments, as a consequence of a generalization of KL-divergence to trees, tree entropy may, for example, be adapted to measure similarity between a first item and a second item (e.g., a document and a query, respectively), wherein both the items are classified into the same taxonomy by one or more classifying procedures. This may be useful, for example, with search and retrieval services, or the like. One example of such an implementation is shown in
An exemplary definition of tree entropy will now be developed in more specificity.
A rooted tree may be denoted by T, and its nodes by V(T). For each node ν of T, let π(ν) and C(ν) denote the parent nodes and the set of children nodes of v respectively. Nodes with empty C(ν) are the leaf nodes of T, denoted by l(T). Each tree T with n leaf nodes may have a set of probabilities p1, . . . , pn associated with the corresponding leaf nodes, which may be denoted by the vector
For simplicity one may use pT to denote the probability associated with the root of the tree T.
Associated with each node νεV(T) is a non-negative real cost cT(ν). For simplicity of notation, c(T) is used to denote the cost of the root of tree T. If T′ is a sub-tree of T, the cost function for T′ will be the natural restriction of that for T, e.g., cT′(ν)=cT(ν) for all nodes νεV(T′). One may drop the subscript and denote the cost function simply as c(·).
The tree entropy for tree T and probability vector {right arrow over (p)} may be denoted by H(T,
One may denote the Shannon entropy (or simply entropy) of a distribution by H1(
then one may define
For simplicity, the recursive definition of tree entropy will be presented first. After that, it will be shown how the definition actually arises from a set of axioms similar to the original entropy axioms by Shannon.
The recursive definition of tree entropy may include the base case R1, and the recursive hypothesis R2 that utilizes the structure of the tree.
R1. Base case (e.g., a “flat′ tree): For all n-dimensional
where H1(
R2. Inductive case (e.g., with inner nodes in terms of children): Let the root of T have children u1, . . . , uk, and let Ti denote the sub-tree rooted at ui, for each iε[k]. Let Sk be a star graph, whose root is the root of T and whose leaf nodes are u1, . . . , uk. Further, let c(Sk)=c(T). Then for all
where
Notice that R1 and R2 together provide the recurrence:
Note that R1 essentially implies that for a tree (or sub-tree) with a single node, the tree entropy for that tree (or its restriction to a sub-tree) is trivially zero, irrespective of the probability of the node and its cost. For a “flat” tree (or sub-tree) of a root connected only to leaf nodes, the tree provides no additional information separating any set of leaf nodes from the rest, implying that each leaf is completely separate from the others. In this case, as R1 points out, the tree entropy reduces to Shannon entropy, (e.g., to within the constant factor c(Sn) ). R2 may be used, for example, to compute tree entropy by recursively using the base case: e.g., the tree entropy for a tree (or sub-tree) is the sum of those of its children sub-trees, plus the additional entropy incurred in the distribution of the probability at the root among its children. The costs at each node may be used in determining the effect of the tree structure on the final form of the tree entropy. As described below, in certain implementations, setting all node costs to one (=1) may reduce the results to Shannon entropy, while other cost functions may allow a tree entropy formulation to satisfy additional tree-specific desiderata.
Several axioms associated with tree entropy will now be introduced. It may not be immediately clear why R1 and R2 are the “right” rules to use in order to define tree entropy. However, as will be shown, they arise as consequences of Shannon's original axioms on entropy, modified to handle hierarchical structures, such as, e.g., trees.
Shannon's seminal paper (e.g., C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379-423, 1948) gave three desiderata, from which the uniqueness (up to a constant factor) of informational entropy was derived. Firstly, the entropy will be a continuous function in the pi. Secondly, if there are n possible outcomes, all of which are equally likely (e.g., pi=1/n for all i), then the entropy is monotonically increasing in n. Thirdly, let Π be a partition of the possible outcomes, and for each IεΠ, let
It will now be shown that one may model requirements after these conditions, and establish a recursive definition of tree entropy. Here, one may use the first condition essentially without modification and alter the second and third conditions to respect an underlying hierarchical structure (e.g., of the tree, etc.). For the second condition, one may modify it by restricting attention to leaf nodes that are siblings of each other. In a modification of the third condition, to respect the hierarchical structure, one may restrict the set of allowable partitions; for example to only allow partitions that do not “cross” sub-tree boundaries.
Formally, given a tree T, and a partition Π of the leaf nodes of T, it may be said that Π respects T, if for every IεΠ, there is a sub-tree of T, denoted T1, whose leaf nodes are a superset of I, and for every I, JεΠ, the sub-trees T1 and TJ do not intersect unless TI=TJ. If, for example, p was the probability distribution on the leaf nodes of T, we define
One may then establish the following:
One may also use the following axioms, which consider an underlying weighted-tree structure.
It will now be considered how one may derive the recursive definition postulates R1 and R2 from these axioms. Observe that encoded in R1, is the notion that for “flat” trees, the standard Shannon entropy and tree entropy are the same. More concretely, let Sn denote the rooted star graph on n+1 nodes, which consists of a root with n children, each of which is a leaf node. Let S0 be the tree consisting of a single node. Then Axiom 3 using tree Sn is precisely the same as Shannon's third condition, as all partitions of the leaf nodes respect Sn. Furthermore, using tree Sm (for m very large), and utilizing Axiom 4, one may see that Axiom 2 yields Shannon's second condition. Hence, in fact tree entropy on Sn will be precisely Shannon entropy (up to a constant factor). Axiom 5 shows that this constant may be proportional to c(Sn). For convenience, it will be assumed that it is precisely c(Sn). Hence, this presents the base case R1.
With regard to the recursive case, suppose the root of T has children u1, . . . , uk, and let Ti denote the sub-tree rooted at ui, for each iε[k]. Define Π to be the partition of the leaf nodes of T whose ith piece consists of the leaf nodes of sub-tree Ti. Applying Axiom 3, one finds that
Applying Axiom 3 again, this time to TΠ, with partition P′ that-puts each leaf node into a separate class. This time, one finds that H(T,
Combining these two leads to the recursive hypothesis R2.
It is next shown that for every cost function c(·), there is a unique tree entropy function that satisfies R1 and R2. For every distribution on the leaf nodes of a given tree, this function agrees with the Shannon entropy when the cost function is equal to 1 for all nodes.
Let T be a tree with root r, the set of leaf nodes l(T) and cost function c(·). For simplicity, we let V(T)\{r} be denoted as V
where w(ν)=c(π(ν)) if ν is a leaf node, and w(ν)=c(π(ν))−c(ν) otherwise.
The above theorem exposes two different viewpoints of the same concept. First, tree entropy is shown to depend on the relative probabilities of a (parent, child) pair in Equation (2), weighted by the parent cost (e.g., dependency data). Apart from the cost, this differs from Shannon entropy in a critical way: the probability of a node v is considered only with respect to that of its parent, instead of the total probability over all leaf nodes. This is what accounts for the dependencies that are induced by the hierarchy.
The second viewpoint shows that tree entropy presents a weighted version of entropy, wherein the weights w(ν) depend on the costs of both the node and its parent in Equation (4). Thus, the dependencies induced by the hierarchy are taken into account in the weighting parameters instead of in the probabilities.
As a further illustration of tree entropy as measurable, for example, using Equation (4) as shown above, consider the following example based, at least in part, on the exemplary distributions for items 600 and 602 presented in
With regard to item 600, dependency data for the “North” inner node may be based, at least in part, on the sum of either the distribution data and/or established dependency data for its children nodes. Here, for example, the children nodes, “San Francisco” and “San Jose”, are both leaf nodes and as such their distribution data may be used to establish dependency data for the North node (e.g., equal to 0.4+0.5=0.9).
Similarly, dependency data for the “South” inner node may be based, at least in part, on the sum of either the distribution data and/or established dependency data for its children nodes. Here, for example, the children nodes, “San Diego” and “Los Angeles”, are both leaf nodes and as such their distribution data may be used to establish dependency data for the South node (e.g., equal to 0.05+0.05=0.10).
Based, at least in part, on such distribution data and established dependency data, Equation (4) for example, may be applied to determine a tree entropy value for item 600. At least one weighting parameter may also be applied to further modify all or part of the established dependency data. Thus, the tree entropy value may, for example, be calculated by performing the summation process per Equation (4) which would sum together the distribution data and dependency data for each node in the tree as determined by various multiplication and logarithmic functions. Here, for example, assuming a weighting parameter of 1, the summation may include:
(1×0.4)log 0.4≈−0.16 (for the San Francisco leaf node),
(1×0.5)log 0.5≈−0.15 (for the San Jose leaf node),
(1×0.05)log 0.05≈−0.07 (for the San Diego leaf node),
(1×0.05)log 0.05≈−0.07 (for the Los Angeles leaf node),
(1×0.9)log 0.9≈−0.04 (for the North inner node),
(1×0.10)log 0.10≈−0.1 (for the South inner node), and
when summed together and multiplied by (−1) produces a tree entropy value of ≈0.59 for item 600.
Similarly, with regard to item 602, assuming a weighting parameter of 1, the summation may include:
(1×0.4)log 0.4≈−0.16 (for the San Francisco leaf node),
(1×0.1)log 0.1≈−0.1 (for the San Jose leaf node),
(1×0.4)log 0.4≈−0.16 (for the San Diego leaf node),
(1×0.1)log 0.1≈−0.1 (for the Los Angeles leaf node),
(1×0.5)log 0.5≈−0.15 (for the North inner node),
(1×0.5)log 0.5≈−0.15 (for the South inner node), and
when summed together and multiplied by (−1) produces a tree entropy value of ≈0.82 for item 602.
Thus, as this example illustrates, based, at least in part, on the tree entropy values measured above, item 600 with a tree entropy value of ≈0.59 appears to be more focused than does item 602 with a tree entropy value of ≈0.82.
A proof of theorem 1 is as follows. For all trees T, define
Next, it will be shown that h(T,
Notice that,
satisfies R1.
Next, let T be an arbitrary tree with root 7 and cost function c. Let uI, . . . , uk denote the children of r, and let Ti denote the sub-tree of T rooted at ui for each iε[k]. As before, let V
where Sk, the star with k leaf nodes, is the subgraph of T restricted to the root and its children with the natural cost function c(Sk)=c(T), and q=(pu
Thus, R2 is satisfied. Hence, the function h(T,
It will next be shown that h(·,·) is the unique function satisfying R1 and R2. To this end, suppose that g(·,·) is another function satisfying R1 and R2. Since any function satisfying R1 and R2 must satisfy Equation (1),
where ui and Ti are as above. Now, define Δ(T,
By R1, since h and g agree on every star graph Sn, Δ(Sn,
It is shown next that (3) follows from
Hence,
Equation (4) follows from (3) by definition.
In this section some exemplary properties that may be satisfied by tree entropy are shown. First, the definition of tree entropy trivially includes Shannon entropy as a special case. The next property notes that because of the normalization in the definition of tree entropy, H(T,
Then H(T,
This may be proven as follows. Let V
Using the above property, one may extend the tree so that every leaf node is at the same depth, without changing the tree entropy. Thus, one may assume that such trees are leveled.
This may be proven as follows. From Theorem 1,
and by an assumption, w(ν)≧0. Let χν be the vector with entries 1/pT corresponding to the leaf nodes in the sub-tree rooted at ν, and 0 for all other leaf nodes. Each term in the sum is of the form f(
Examples may be constructed to show that if pT is not a constant, H(T,
In this section some exemplary techniques are presented that may be used, for example, in choosing a cost function. The definition of tree entropy presented in the examples above assumes an intrinsic cost function associated with the tree. In these non-limiting examples, the only condition that has been imposed on such exemplary cost functions was that cost of a node be greater than or equal to that of its children (c(π(ν))≧c(ν)), in order to ensure concavity of the tree entropy (e.g., see Property 5). In this section, some other exemplary properties are presented that tree entropy may satisfy and/or which may drive a choice of an appropriate cost function should one be desired.
Over all probability distributions
the distribution that corresponds to a maximum uncertainty. For tree entropy, however, a distribution at which tree entropy is maximized for a given tree depends not only on the tree structure but also the cost function c(·). In certain implementations one may, for example, decide to impose conditions on a cost function such that tree entropy is maximized for distributions corresponding to “maximum uncertainty” for the given tree structure.
One may start with the simple case when T is a leveled k-ary tree with n leaf nodes. For this exemplary tree, it may be assumed that the distribution with maximum uncertainty is the uniform distribution on the leaf nodes.
Assume that the probability distribution
The sum of pν over all nodes ν at the same depth from the root is 1 (since pT=1), so that these numbers form a probability distribution for each level. The above expression may therefore be written as the sum of the Shannon entropies of the probability distributions at each level. The Shannon entropy may be maximized by the uniform distribution, so tree entropy for such a cost function may be maximized by the distribution
since this distribution leads to a uniform distribution at every level in the tree.
The above argument depended on the fact that the tree was a leveled, k-ary tree. Next leveled trees are considered, which are not necessarily k-ary. It is first shown that distribution on the leaf nodes corresponds to “maximum uncertainty”.
At any node νεV(T), the weight distribution among the children of ν may be maximally uncertain, or most non-coherent, if all the children of ν have equal weights. Labeling the n leaf nodes of T with numbers 1, . . . , n, one may recursively define a probability distribution
where b(ν) is the number of children of node ν. If the root of T has k children u1, . . . , uk, then pu
It is now considered what conditions one may we impose on a cost function so that the distribution
Let Hmax(T)=max
where q1=pu
Thus, for example, consider H(Ti,
where the maximum is taken over all
is the same.
If any of the above holds, then
for any path r=ν0, . . . , νd from r to a leaf of T.
Here is another way to understand this result. Let T1 and T2 be two sub-trees in T whose roots are siblings. The formula for Hmax(T) and the associated condition on the cost function says that even if the average branching factor in T1 is much larger than that of T2, both T1 and T2 contribute equally to the maximum entropy. In terms of the taxonomy, this means, for example, that at any level of the hierarchy, each node (e.g., an aggregated class) captures the same amount of “uncertainty” (or information) about the item. The fact that T1 has larger branching factor on average only means that on average, the mutual coherence of two siblings in T1 is much less than the mutual coherence of siblings in T2, e.g., T1 makes much finer distinction between classes than T2.
This may be seen mathematically as follows. Define c′(ν)=c(ν)1g b(ν). Then condition (3) of Theorem 6 says that
is the same over all paths. By Theorem 1, the formula for tree entropy becomes
In other words, the base of the logarithm is now the branching factor of the parent, reflecting the fact that one may be as uncertain at nodes with high branching factor as over small ones. Another view is that when one encodes messages, one may use a larger alphabet when the branching factor is larger.
Note that, if a node has two (or more) sub-trees, one of which is a leaf node, then condition (3) of Theorem 6 cannot hold unless all of the sub-trees are leaf nodes. Further, if the branching factor at a node, b(ν) is 1, then 1g b(ν)=0. Hence, simply extending the leaf node by adding an edge to it cannot solve the problem (since it does not change the sum in condition (3)). In fact, given T, let T′ be the unique graph with the smallest number of edges, over all graphs homeomorphic to T. Then if one of the leaf nodes of T′ has no siblings, then there is no cost function satisfying the theorem. In those cases, it may make sense to redefine where the maximum tree entropy occurs, by ignoring those “only-children leaf nodes.” On the other hand, if all leaf nodes of T′ have siblings, then there should be no such problem.
A proof of Theorem 6 will now be presented. Throughout, suppose that T has k children u1, . . . , uk, and Ti is the sub-tree rooted at ui for iε[k]. We let qi=pu
First, suppose that condition (1) holds. It may be shown that condition (2) must hold as well, by induction on the height of T. The base case, when T has height 1, follows naturally. So consider a general tree T.
Let,
By Equation (5), Hmax(T)=max
One may take a partial derivative of f with respect to q1 for t<k. Recall that
Since c(T)≧0, f is a convex function. Hence, f is maximized at the point that all of its partial derivatives are 0. But since condition (1) holds, that will be when
0=c(T)[−1g k+1g k]+Hmax(Ti)−Hmax(Tk).
That is, Hmax(Tt)=Hmax(Tk). Since this is true for all t, one may see that Hmax(Ti)=Hmax(Tj) for all i,jε[k]. Hence by Equation (5), for any lε[k]
H
max(T)=c(T)1g k+Hmax(Tl) (6).
Recall Equation (1):
Substitute
Combining this with Equation (6), one may see that,
Hence, H(Tl,
Now assume condition (2) holds. It may be shown that condition (1) must hold, by induction on the height of T. The base case, for T consisting of a single node, follows naturally. So consider a general T.
Let f be as above. Again,
By condition (2), Hmax(Tt)=Hmax(Tk) for all tε[k]. Hence,
if and only if qt=qk. That is, all the partial derivatives of f are 0 only when qi=1/k for all iε[k]. Since c(T)≧0, f is convex. So the unique maximum of f occurs at this point. Again, by Equation (5), one may have that Hmax(T)=max
By induction, one may have that Hmax(Ti)=Hmax(Ti,
Now, suppose that condition (1) holds. It may be shown that condition (3) holds as well. To do this, one may prove by induction on the height of T that for any path r=ν0, . . . , νd from the root of T to a leaf node of T,
The base case is trivial, so consider a general T.
By Equation (6), one may see that Hmax(T)=c(ν0)1g b(ν0)+Hmax(Tl).
Choose l such that Tl is rooted at node ν1. Then by induction,
This shows that
as wanted.
Now, suppose that condition (3) holds. It may again be proven by induction on the height of T that
for any path r=ν0, . . . , νd from r to a leaf node of T. The base case, when T is a single node, follows naturally. So consider a general T.
Let lε[k], and note that for all paths ul=ν1′,ν2′, . . . , νd′ from the root to Tl to a leaf node of Tl, one may have that (from condition (3)),
is the same. Hence,
is the same over all such paths. Thus, one may apply an inductive hypothesis to Tl. That is,
Consider a path r=νo, ν1, . . . , νt from r to a leaf of T such that ν1=uj. Then, by condition (3),
That is, Hmax(Tj)=Hmax(Tl), for all j, lε[k]. Hence,
Let U, V be sub-trees of T with roots x, y, respectively, with x and y siblings. Let r=ν0, ν1, . . . , νd=π(x) be the path from r to the parent of x (which is also the parent of y). Let x=x0, x1, . . . , xs be a path from x to a leaf node of U, and let y=y0, y1, . . . , yt be a path from y to a leaf of V. Then, by condition (3) and the claim just proved,
Thus, condition (2) follows.
To finish the proof of the theorem, notice that as just showed that condition (3) implies that
for any path r=ν0, . . . , νd from r to a leaf node of T.
It is now shown how one may generalize the notion of KL-divergence (see, e.g., S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79-86, 1951) to tree entropy; this aspect may be referred to as “tree divergence”.
Since the KL-divergence is a measure of the similarity of two probability distributions over the same alphabet, one may think of tree divergence as dealing with two probability distributions over the same tree with the same cost function. The argument presented here may, for example, be generalized to distributions over different trees; the results are less intuitive.
Recall the KL-divergence can be defined in terms of Bregman divergence (see, e.g., L. M. Bregman. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200-217, 1967). For any concave, continuously-differentiable function, f, the Bregman divergence of f, denoted Bf(·∥·) is defined as Bf(
The KL-divergence is defined as the Bregman divergence of the entropy function,
Notice that one may assume that
an ignore that constraint when taking a derivative.
Likewise, one may define the tree divergence as the Bregman divergence of the tree entropy function, where one may ignore the normalization. Fix a tree T, and denote the tree divergence for tree T by KLT(·∥·). For convenience, assume that
Let V
and define KLT(
A proof of Theorem 7 will now be presented. Recall that
One may first calculate ∇100. Recall that if ν lies in the path from the root to leaf node i, then
otherwise it is 0. Let pathi be the set of nodes in the path from the root of T to the leaf node i, not including the root itself. One may have that the ith entry of ∇100 (
Thus,
In this section we provide additional interpretation of the definition of tree entropy is presented via an exemplary generative model. Here, it will be assumed that tree T has exactly n leaf nodes and c(T)=1.
First, consider a very straightforward generative model. Starting at the root of T, move to one of its children, with probability of going to child u exactly pu. Once arriving at this new node, go to one of its children, with the probability of going to child ν exactly pν. Repeat this until a leaf node is reached. At this point, output the name of that node. Repeating this process over and over, it is easy to see that this generates a string of leaf names, with the probability of outputting leaf ν equal to pν. So the entropy of this sequence is just the Shannon Entropy of the distribution
One extension of this would be to output the entire path taken. But it is not hard to see that the entropy of the sequence generated in this way is precisely the same as the entropy of the sequence consisting only of leaf names, since each leaf name uniquely determines the path to the root.
Rather than simply outputting the entire path from root to leaf, suppose that it is desired to output, for example,.the fourth node in the path. For instance, in a classifier and taxonomy example, one might desire a classification of the element with some specified level of granularity. Items that are close together in the tree may look identical at coarse levels of granularity, while items that are far from each other in the tree may still be different. More specifically, choose a path as above, e.g., ν0, ν1, . . . , νd, where ν0 is the root and νd is a leaf node. Output exactly one of ν1, . . . , νd, with the probability of outputting νi equal to w(νi). Recall, that w(νi)=c(νi−1)−c(νi) for i<d and w(νd)=c(νd−1). Notice that, since it was assumed c(T)=1, the sum of these probabilities is exactly 1. Here, for example, when outputting a node name one may also record on which level it is.
Upon transmitting the sequence of node names generated by repeating this process, assuming that both the transmitter and the receiver knows from which level each node name came, leads to the following.
Put another way, tree entropy for T is equal to the Shannon entropy of the above sequence, conditioned on knowing the level for the ith node name produced, for all i.
In the foregoing detailed description the notion of entropy of a distribution specified on the leaf nodes of a tree has been systematically developed. As shown, this definition may be a unique solution to a small collection of axioms and may be a strict generalization of Shannon entropy. Tree entropy, for example, may be adapted for a variety of different data processing tasks, such as, data mining applications, including classification, clustering, taxonomy management, and the like.
While certain exemplary techniques have been described and shown herein using various methods and systems, it should be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from claimed subject matter. Additionally, many modifications may be made to adapt a particular situation to the teachings of claimed subject matter without departing from the central concept described herein. Therefore, it is intended that claimed subject matter not be limited to the particular examples disclosed, but that such claimed subject matter may also include all implementations falling within the scope of the appended claims, and equivalents thereof.