HIERARCHICAL STRUCTURE ENTROPY MEASUREMENT METHODS AND SYSTEMS

BACKGROUND

1. Field

The subject matter disclosed herein relates to data processing, and more particularly to data processing methods and systems that measure entropy and/or otherwise utilize entropy measurements.

2. Information

Data processing tools and techniques continue to improve. Information in the form of data is continually being generated or otherwise identified, collected, stored, shared, and analyzed. Databases and other like data repositories are common place, as are related communication networks and computing resources that provide access to such information.

The Internet is ubiquitous; the World Wide Web provided by the Internet continues to grow with new information seemingly being added every second. To provide access to such information, tools and services are often provided which allow for the copious amounts of information to be searched through in an efficient manner. For example, service providers may allow for users to search the World Wide Web or other like networks using search engines. Similar tools or services may allow for one or more databases or other like data repositories to be searched.

With so much information being available, there is a continuing need for methods and systems that allow for pertinent information to be located or otherwise identified in an efficient manner.

BRIEF DESCRIPTION OF DRAWINGS

Non-limiting and non-exhaustive aspects are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various figures unless otherwise specified.

FIG. 1 is block diagram illustrating an exemplary embodiment of a computing environment system having one or more devices configurable to measure entropy or otherwise utilize entropy measurements.

FIG. 2 is a functional block diagram illustrating certain features in an exemplary entropy measurement process that may be implemented, for example, using one or more devices such as shown in FIG. 1.

FIG. 3 is a functional block diagram illustrating certain features in an exemplary entropy measurement process that may be implemented, for example, using one or more devices such as shown in FIG. 1.

FIG. 4 is a functional block diagram illustrating certain features in an exemplary divergence measurement process that may be implemented, for example, using one or more devices such as shown in FIG. 1.

FIG. 5 is a flow diagram illustrating an exemplary tree entropy measurement method and an exemplary tree divergence measurement method that may be implemented, for example, using one or more devices such as shown in FIG. 1.

FIG. 6 is an illustrative diagram showing items as classified into a taxonomy having a hierarchical structure that may be used, for example, by one or more devices such as shown in FIG. 1.

DETAILED DESCRIPTION

Techniques are provided herein that may be used to allow for pertinent information to be located or otherwise identified in an efficient manner. These techniques may, for example, allow for more efficient searching of items that may be classified into a taxonomy having a hierarchical structure by measuring entropy associated with the classification distribution and inherent hierarchical dependency.

FIG. 1 is block diagram illustrating an exemplary embodiment of a computing environment system 100 that may include one or more devices configurable to measure entropy and/or divergence, or to otherwise utilize entropy measurements. System 100 may include, for example, a first device 102, a second device 104 and a third device 106, which may be operatively coupled together through a network 108.

First device 102, second device 104 and third device 106, as shown in FIG. 1, are each representative of any device, appliance or machine that may be configurable to exchange data over network 108. By way of example but not limitation, any of first device 102, second device 104, or third device 106 may include: one or more computing devices or platforms, such as, e.g., a desktop computer, a laptop computer, a workstation, a server device, or the like; one or more personal computing or communication devices or appliances, such as, e.g., a personal digital assistant, mobile communication device, or the like; a computing system and/or associated service provider capability, such as, e.g., a database or data storage service provider/system, a network service provider/system, an Internet or intranet service provider/system, a portal and/or search engine service provider/system, a wireless communication service provider/system; and/or any combination thereof.

Similarly, network 108, as shown in FIG. 1, is representative of one or more communication links, processes, and/or resources configurable to support the exchange of data between at least two of first device 102, second device 104, and third device 106. By way of example but not limitation, network 108 may include wireless and/or wired communication links, telephone or telecommunications systems, data buses or channels, optical fibers, terrestrial or satellite resources, local area networks, wide area networks, intranets, the Internet, routers or switches, and the like, or any combination thereof.

As illustrated, for example, by the dashed lined box illustrated as being partially obscured of third device 106, there may be additional like devices operatively coupled to network 108.

It is recognized that all or part of the various devices and networks shown in system 100, and the processes and methods as further described herein, may be implemented using or otherwise include hardware, firmware, software, or any combination thereof.

Thus, by way of example but not limitation, second device 104 may include at least one processing unit 120 that is operatively coupled to a memory 122 through a bus 128.

Processing unit 120 is representative of one or more circuits configurable to perform at least a portion of a data computing procedure or process. By way of example but not limitation, processing unit 120 may include one or more processors, controllers, microprocessors, microcontrollers, application specific integrated circuits, digital signal processors, programmable logic devices, field programmable gate arrays, and the like, or any combination thereof.

Memory 122 is representative of any data storage mechanism. Memory 122 may include, for example, a primary memory 124 and/or a secondary memory 126. Primary memory 124 may include, for example, a random access memory, read only memory, etc. While illustrated in this example as being separate from processing unit 120, it should be understood that all or part of primary memory 124 may be provided within or otherwise co-located/coupled with processing unit 120.

Secondary memory 126 may include, for example, the same or similar type of memory as primary memory and/or one or more data storage devices or systems, such as, for example, a disk drive, an optical disc drive, a tape drive, a solid state memory drive, etc. In certain implementations, secondary memory 126 may be operatively receptive of, or otherwise configurable to couple to, a computer-readable medium 128. Computer-readable medium 128 may include, for example, any medium that can carry and/or make accessible data, code and/or instructions for one or more of the devices in system 100.

Second device 104 may include, for example, a communication interface 130 that provides for or otherwise supports the operative coupling of second device 104 to at least network 108. By way of example but not limitation, communication interface 130 may include a network interface device or card, a modem, a router, a switch, a transceiver, and the like.

Second device 104 may include, for example, an input/output 132. Input/output 132 is representative of one or more devices or features that may be configurable to accept or otherwise introduce human and/or machine inputs, and/or one or more devices or features that may be configurable to deliver or otherwise provide for human and/or machine outputs. By way of example but not limitation, input/output device 132 may include an operatively configured display, speaker, keyboard, mouse, trackball, touch screen, data port, etc.

With regard to system 100, in certain implementations first device 102 may be configurable, for example, using a browser or other like application, to seek the assistance of second device 104 by providing or otherwise identifying a query that second device 104 may then process. For example, one such query may be associated with a search engine provider service provided by or otherwise associated with second device 104. In response to such a query, for example, second device 104 may then provide or otherwise identify a query response that first device may then process.

Here, for example, to process such a query second device may be configured to access stored data associated with various items that may be available within system 100 and which may be of interest or otherwise associated with information included within the query. The stored data may, for example, include data that identifies the item, its location, etc. By way of example but not limitation, the item may include a document or web page that is accessible from, or otherwise made available by, third device 106 as part of the World Wide Web portion of the Internet.

Continuing with this example, second device 104 may be configured to examine the stored data in such a manner as to identify one or more items deemed to be relevant to the query. By way of example but not limitation, second device 104 may be configurable to select items deemed relevant to such a query based, at least in part, on scores assigned to or otherwise associated with potential candidate items. Such scores (e.g., PageRank, etc.) and/or other like useful search engine data may, for example, result from other processes conducted by second device 104 or other devices. For example, one or more devices may be configurable to identify items, classify. items, and/or score the items as needed to provide or maintain additional (e.g., perhaps local) stored data that may be accessed by a search engine in response to a query.

Reference is now made to FIG. 2, which is a functional block diagram illustrating certain features in an exemplary entropy measurement process 200 that may be implemented, for example, using one or more devices such as those in system 100.

Process 200 may, for example, include at least one item identifying procedure 202 that generates or otherwise identifies item data 204. By way of example but not limitation, item identifying procedure 202 may include one or more web crawlers or other like processes that communicate with applicable devices coupled to network 108 and operate to gather information about items available through or otherwise made accessible over network 108 by such devices. Such processes and other like processes are well known and beyond the scope of the present subject matter.

Item data 204 may, for example, include information about the item such as identifying information, location information, etc. Item data 204 may, for example, include all or a portion of the text or words associated with information that may be included in the item.

As used herein, the term “item” is meant to include any form or type of data that may be communicated. By way of example but not limitation, an item may include all or part of one or more web pages, documents, files, databases, objects, messages, queries, and the like, or any combination thereof.

Process 200 may, for example, include at least one classifying procedure 206 that accesses item data 204 and generates or otherwise identifies taxonomic data 208 associated with the item. By way of example but not limitation, classifying procedure 206 may be configurable to classify all or part of item data 204 into a taxonomy having a hierarchical structure. For example, at least a portion of one exemplary taxonomy may include a tree or sub-tree structure having a root node that is superior to one or more levels comprising one more inner nodes that are superior to a plurality of leaf nodes. Classifying procedure 206 may, for example, be configurable to assign distribution data 208a to such leaf nodes. For example, in certain implementations distribution data 208a may include a distribution value (e.g., a normalized value) or the like that is assigned to a leaf node. In other implementations, for example, distribution data 208a may include a probability associated with individual leaf nodes.

Taxonomic data 208 may, for example, include dependency data 208b that is associated with the hierarchical structure. For example, dependency data 208b may include data associated with the distribution and/or arrangement of inner nodes within the hierarchical structure.

An entropy measurement procedure 210 may be configurable to access taxonomic data 208 and generate or otherwise identify entropic data 212 associated with the taxonomic data and hence the item data. As illustrated in FIG. 2, entropic data 212 may, for example, include a tree entropy value 212a. The notion of “tree entropy” may, for example, be defined as shown in the examples presented in subsequent sections. Such definitions are applicable or otherwise clearly adaptable for use in entropy measurement procedure 210 and in generating or otherwise identifying entropic data 212 including tree entropy value 212a.

Entropy measurement procedure 210 may be configurable to access distribution data 208a and to either access and/or otherwise establish dependency data 208b (e.g., as shown within entropy measurement procedure 210). Dependency data 208b may, for example, be established based, at least in part, on the hierarchical structure, or an applicable portion thereof, as per the taxonomy applied by classifying procedure 206 and with consideration of the distribution data 208a.

As illustrated, entropy measurement procedure 210 may, for example, include the application of at least one cost function 226 in establishing dependency data 208b. As illustrated, entropy measurement procedure 210 may, for example, include the application of at least one weighting parameter 228 in establishing dependency data 208b. Several exemplary weighting parameters and cost functions, e.g., which may be used to establish weighting parameters, are described in greater detail below.

Also, as described in greater detail below, a tree entropy operation or formula may, by way of example but not limitation, be applied by entropy measurement procedure 210 such that the resulting entropic data 212 provides a measure of the extent to which the item is topic-focused with regard to the topic of the taxonomy.

In certain implementations, all or portions of dependency data 208b may be provided in taxonomic data 208, for example, as generated by classifying procedure 206 or the like. For example, it may be beneficial for classifying procedure 206 to be further configurable to perform at least some of the processing associated with the establishment of dependency data 208b (e.g., while establishing distribution data 208a). In other implementations, for example, all or portions of dependency data 208b may be established by measurement procedure 210.

With respect to exemplary process 200, entropic data 212 which may include, for example, tree entropy value 212a, which may then be provided or otherwise made accessible to an item scoring procedure 214. Item scoring procedure 214 may, for example, be configurable to establish or otherwise identify item score data 218. Item scoring procedure 214 may, for example, be configurable to establish item score data 218 based, at least in part, on entropic data 212 and one or more other parameters 216 (e.g., a PageRank or related metric(s), etc.). In certain implementations, for example, item score data 218 may include a single numerical score associated with the item identified in item data 204.

A search engine procedure 220 may be configurable to receive or otherwise access item score data 218 and based, at least in part, on item score data 218 provide or otherwise identify a query response 224 in response to a query 222.

Thus, as illustrated in the preceding example, in accordance with certain aspects of the methods and systems presented herein, entropy measurement techniques or resulting entropic measurements may be used to possibly refine or otherwise further support in some manner a data query, search engine, or other like data processing service, system, and/or device.

Reference is now made to FIG. 3, which is a functional block diagram illustrating certain features in an exemplary entropy measurement process 300 that may be implemented, for example, using one or more devices such as shown in FIG. 1. As illustrated, process 300 may, for example, include classifying procedure 206 that accesses item data 304 and establishes taxonomic data 308, and tree entropy procedure 210 that accesses taxonomic data 308 and establishes entropic data 312.

With this example, it is illustrated that entropy measurement techniques or resulting entropic measurements may be used to possibly test or otherwise study the performance of classifying procedure 206. Thus, for example, item data 304 may be carefully selected or otherwise specifically created to “focus” within a given taxonomy in a desired manner. For example, item data 304 may be thought to be very focusable or conversely barely focusable on the taxonomy. As such, once classifying procedure 206 has generated taxonomic data 308, tree entropy procedure 210 may be employed to generate entropic data 312, which may then be examined to judge the performance of classifying procedure 206.

Attention is now drawn to FIG. 4, which is a functional block diagram illustrating certain features in an exemplary tree divergence process 400 that may be implemented, for example, using one or more devices such as shown in FIG. 1. Process 400 may, for example, be yet another exemplary implementation based, at least in part, on the tree entropy techniques and methods presented herein. Process 400 may, for example, be used to determine or otherwise measure divergence between taxonomic data associated with two different items.

As shown process 400 may, for example, include classifying procedure 206 that accesses item data 204 and establishes taxonomic data 208, and a classifying procedure 406 that accesses second item data 404 and establishes taxonomic data 408. Here, for example, the classifying procedures 206 and 406 may be the same or different. Process 400 may include, for example, a divergence measurement procedure 402 (which may include an entropy measurement procedure 210) that accesses taxonomic data 208 and taxonomic data 408 to establish a divergence value 410. Process 400 may include, for example, a search engine procedure 220 that accesses at least the divergence value 410 in generating a query response 412 in response to query 222.

In process 400, divergence measurement procedure 402 may, for example, be configurable to measure similarity between the item associated with item data 204 and the second item associated with second item data 404. This measurement may be provided in divergence value 410, and may be used by search engine procedure 220 to adjust or otherwise affect query response 412. For example, in certain implementations, second item data may include or otherwise be based, at least in part, on query 222 such that the resulting tree divergence value 410 may represent how similar the item associated with item data 204 is to the query. In certain situations, it may be desirable for query response 412 to identify some items that do not appear to match as closely as other items that are identified. Thus, for example, if query 222 includes the term “mouse”, then it may be beneficial for the query response to identify some items that appear to focus on an “animal” mouse and others that appear to focus on “computer hardware” related mouse devices.

At this point attention is drawn to FIG. 5, which is a flow diagram illustrating an exemplary method 500 showing a tree entropy measurement method and a tree divergence measurement method, of which all or portions may be implemented, for example, using one or more devices such as shown in FIG. 1.

In 502, an item may be identified for classification into a taxonomy having a hierarchical structure. In 504, the item may be classified and taxonomic data including at least distribution data established. In 506, entropic data for the item may be determined based, at least in part, on the distribution data and established dependency data (e.g., associated with the distribution and hierarchical structure). In 508, a tree entropy value may be identified. In 510, a score value may be determined, for example, based, at least in part, on the tree entropy value from 508 and/or entropic data 506.

In 514, a second item may be identified for classification into the same taxonomy having the same hierarchical structure. In 516, the second item may be classified and taxonomic data including distribution data established. In 518, entropic data for the second item may be determined based, at least in part, on the distribution data and established dependency data. In 520, a tree entropy value may be identified. In 510, a score value may be determined, for example, based, at least in part, on the tree entropy value from 520 and/or entropic data from 518.

In 512, a divergence value may be determined based, at least in part, on the entropic data from 506 and 518. In 510, a score value may be determined, for example, based, at least in part, on the divergence value from 512.

In the following sections, certain exemplary techniques are described that may be used to measure or otherwise determine and/or utilize the entropy of a distribution that takes into account the hierarchical structure of a taxonomy. For example, a formal treatment of “tree entropy” is provided that may be used or otherwise adapted for use in system 100 or portions thereof.

As previously illustrated, one exemplary application of tree entropy may be in the classification of information, such as, where an item may be distributed over various leaf nodes of a given topic taxonomy and it may be desirable to measure or otherwise determine an extent to which the item is topic-focused.

As used herein, entropy refers to a fundamental measure of the uncertainty represented by a probability distribution. By way of example, given a discrete distribution p on symbols [n] specified in the form of a vector p=p₁, . . . , p_nwith p_i≧0 and

$\sum_{i} p_{i} = 1,$

the Shannon entropy H( p) is given by

$\sum_{i = 1}^{n} p_{i} \lg (1 / p_{i}) .$

Assuming that a given item has membership in each of n classes (e.g., as assigned by a classifying procedure), in accordance with certain aspects of the methods and systems presented herein, it may be useful to determine to what extent the item is “focused” with respect to the classes. Here, by way of example but not limitation, such an item may be considered “focused” if its membership is “scattered” as little as possible among all the classes.

One approach might be to interpret the membership of the document in each of the n classes as a probability distribution, and use the Shannon entropy of this distribution as a measure of its focus.

However, considering a scenario where the n classes have some relationship among them; for instance, the classes might represent the leaf nodes of a tree (or sub-tree) that correspond to a geographical taxonomy.

FIG. 6, for example, illustrates the membership of two items 600 and 602 in each of four classes, where each class corresponds to a geographical location. As illustrated, items 600 and 602, this exemplary taxonomy includes a root node (labeled “California”) that is superior (e.g., a parent) to two inner nodes (labeled “North” and “South”), wherein the North inner node is superior to two leaf nodes (labeled “San Francisco” and “San Jose”) and the South inner node is superior to two leaf nodes (labeled “San Diego” and “Los Angeles”). In this example, item 600 has a distribution across the leaf nodes, with the distribution data of (0.4) for the San Francisco leaf node, (0.5) for the San Jose leaf node, (0.05) for the San Diego leaf node, and (0.05) for the Los Angeles leaf node. Item 602 has a different distribution across the leaf nodes, with the distribution data of (0.4) for the San Francisco leaf node, (0.1) for the San Jose leaf node, (0.4) for the San Diego leaf node, and (0.1) for the Los Angeles leaf node.

In this example, item 600 on the left in FIG. 6 appears more focused than item 602 on the right, however, according to the Shannon entropy, item 600 is exactly as (un)focused as item 602. This arises precisely since Shannon entropy ignores the semantics of the symbols associated with the distribution by assuming they are unrelated to each other. Thus, for example, the Shannon entropy of a distribution may not capture underlying relationships between symbols, such as those given by a taxonomy.

Thus, in accordance with certain aspects of the present subject matter, a more principled and/or systematic technique has been developed that may provide for methods and systems that consider entropic properties of a distribution on a hierarchical structure, such as, for example, dependency data associated with the hierarchal structure of a tree, sub-tree, or the like.

In the following sections an exemplary definition of “tree entropy” is provided by first postulating a set of axioms for tree entropy; these are generalizations of Shannon's axioms to a tree case. The set of axioms leads to a recursive definition from which an explicit functional form of tree entropy may be derived which satisfies the desired axioms. Several interesting properties of tree entropy will be described which tend to demonstrate the robustness of the definition. For example, tree entropy may be invariant under simple transformations of the tree and scaling of the probability distribution. Under an additional yet reasonable assumption on a cost function, for example, tree entropy may be a concave function. Further, under certain conditions tree entropy may be maximized for distributions corresponding to “maximum uncertainty” for the given tree structure. Still further, as will be described, a generalization of KL-divergence may be derived for tree entropy, for example, in the situation wherein two probability distributions over the same tree have the same cost function. Additionally, as shown below, an interpretation of tree entropy may be made, for example, by means of a model for generating symbols (e.g., in the form of or otherwise associated with dependency data).

Specifying natural requirements via a set of axioms and pinning down the functions satisfying these axioms has often resulted in fundamental insights for many problems, some well known ones being the axioms for voting (see, e.g., K. Arrow. Social Choice and Individual Values (2nd Ed.). Yale University Press, 1963), clustering (see, e.g., J. Kleinberg. An impossibility theorem for clustering. In Proceedings of the 16th Conference on Neural Information Processing Systems, 2002), and PageRank (see, e.g., A. Altman and M. Tennennholtz. Ranking systems: The pagerank axioms. In Proceedings of the 6th ACM Conference on Electronic Commerce, pages 1-8, 2005).

While these so-called axiomatic approaches have often been used to refute the existence of an ideal procedure in these problems, as shown below, the result for tree entropy appears to be different in that, after formulating certain rules, one may construct a function that uniquely satisfies them.

In accordance with certain embodiments, tree entropy may, for example, be adapted to measure a cohesiveness of an item when it is classified into a taxonomy. Thus, for example, tree entropy may be used to determine how focused or unfocused such an item is on a topic. One example of such an implementation is shown in FIG. 2.

In accordance with certain other embodiments, tree entropy may, for example, be adapted for use in measuring the performance of a classifying procedure. Thus, for example, given an item that is considered to be well focused one may use tree entropy to measure how well the classifying procedure performs in terms of placing such an item at the leaf nodes of a taxonomy hierarchy. One example of such an implementation is shown in FIG. 3.

In accordance with still other embodiments, as a consequence of a generalization of KL-divergence to trees, tree entropy may, for example, be adapted to measure similarity between a first item and a second item (e.g., a document and a query, respectively), wherein both the items are classified into the same taxonomy by one or more classifying procedures. This may be useful, for example, with search and retrieval services, or the like. One example of such an implementation is shown in FIG. 4.

An exemplary definition of tree entropy will now be developed in more specificity.

A rooted tree may be denoted by T, and its nodes by V(T). For each node ν of T, let π(ν) and C(ν) denote the parent nodes and the set of children nodes of v respectively. Nodes with empty C(ν) are the leaf nodes of T, denoted by l(T). Each tree T with n leaf nodes may have a set of probabilities p₁, . . . , p_nassociated with the corresponding leaf nodes, which may be denoted by the vector p. For a general node ν in T, one may recursively define p_vto be the sum of probabilities associated with the children of v, e.g.,

$p_{v} = \sum_{w \in C (v)} p_{w} .$

For simplicity one may use p_Tto denote the probability associated with the root of the tree T.

Associated with each node νεV(T) is a non-negative real cost c_T(ν). For simplicity of notation, c(T) is used to denote the cost of the root of tree T. If T′ is a sub-tree of T, the cost function for T′ will be the natural restriction of that for T, e.g., c_T′(ν)=c_T(ν) for all nodes νεV(T′). One may drop the subscript and denote the cost function simply as c(·).

The tree entropy for tree T and probability vector {right arrow over (p)} may be denoted by H(T, p). For all sub-trees T′ of T, one may naturally define H(T′, p) by ignoring components of p that are not needed. To normalize the entries (e.g., such that the relevant entries sum to one), one may define, for tree T with root r,

$H (T, \overline{p}) = H (T, \frac{1}{p_{r}} \overline{p}) .$

One may denote the Shannon entropy (or simply entropy) of a distribution by H₁( p). As with tree entropy, if

$p_{0} = \sum_{i \geq 1} p_{i} < 1,$

then one may define

$H_{1} (\overline{p}) = H_{1} (\frac{1}{p_{0}} \overline{p}) .$

For simplicity, the recursive definition of tree entropy will be presented first. After that, it will be shown how the definition actually arises from a set of axioms similar to the original entropy axioms by Shannon.

The recursive definition of tree entropy may include the base case R1, and the recursive hypothesis R2 that utilizes the structure of the tree.

R1. Base case (e.g., a “flat′ tree): For all n-dimensional p with non-negative entries and

$\sum_{i \in [n]} p_{i} = p_{0},$

$H (S_{n}, \overline{p}) = c (S_{n}) H_{1} (\overline{p}) \overset{Δ}{=} - c (S_{n}) \frac{1}{p_{0}} \sum_{i \in [p]} p_{i} \lg \frac{p_{i}}{p_{0}},$

where H₁( p) is the Shannon entropy of the distribution p. Note that this implies that H(S₀, p)=0.

R2. Inductive case (e.g., with inner nodes in terms of children): Let the root of T have children u₁, . . . , u_k, and let T_idenote the sub-tree rooted at u_i, for each iε[k]. Let S_kbe a star graph, whose root is the root of T and whose leaf nodes are u₁, . . . , u_k. Further, let c(S_k)=c(T). Then for all p,

$H (T, \overline{p}) = H (S_{k}, \overline{q}) + \frac{1}{p_{T}} \sum_{i \in [k]} p_{u_{i}} H (T_{i}, \overline{p}),$

where q=(p_u₁, . . . , p_u_k).

Notice that R1 and R2 together provide the recurrence:

$\begin{matrix} H (T, \overline{p}) = - c (T) \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} \lg (\frac{p_{u_{i}}}{p_{T}}) + \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} H (T_{i}, \overline{p}) & (1) \end{matrix}$

Note that R1 essentially implies that for a tree (or sub-tree) with a single node, the tree entropy for that tree (or its restriction to a sub-tree) is trivially zero, irrespective of the probability of the node and its cost. For a “flat” tree (or sub-tree) of a root connected only to leaf nodes, the tree provides no additional information separating any set of leaf nodes from the rest, implying that each leaf is completely separate from the others. In this case, as R1 points out, the tree entropy reduces to Shannon entropy, (e.g., to within the constant factor c(S_n) ). R2 may be used, for example, to compute tree entropy by recursively using the base case: e.g., the tree entropy for a tree (or sub-tree) is the sum of those of its children sub-trees, plus the additional entropy incurred in the distribution of the probability at the root among its children. The costs at each node may be used in determining the effect of the tree structure on the final form of the tree entropy. As described below, in certain implementations, setting all node costs to one (=1) may reduce the results to Shannon entropy, while other cost functions may allow a tree entropy formulation to satisfy additional tree-specific desiderata.

Several axioms associated with tree entropy will now be introduced. It may not be immediately clear why R1 and R2 are the “right” rules to use in order to define tree entropy. However, as will be shown, they arise as consequences of Shannon's original axioms on entropy, modified to handle hierarchical structures, such as, e.g., trees.

Shannon's seminal paper (e.g., C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379-423, 1948) gave three desiderata, from which the uniqueness (up to a constant factor) of informational entropy was derived. Firstly, the entropy will be a continuous function in the p_i. Secondly, if there are n possible outcomes, all of which are equally likely (e.g., p_i=1/n for all i), then the entropy is monotonically increasing in n. Thirdly, let Π be a partition of the possible outcomes, and for each IεΠ, let p₁be p restricted to the coordinates in I, with all other coordinates set to 0, and let q₁be the sum of the entries of p_I. Then,

$H_{1} (\overline{p}) = H_{1} (\overline{q}) + \sum_{I \in Π} q_{I} H_{1} ({\overline{p}}_{I}) .$

It will now be shown that one may model requirements after these conditions, and establish a recursive definition of tree entropy. Here, one may use the first condition essentially without modification and alter the second and third conditions to respect an underlying hierarchical structure (e.g., of the tree, etc.). For the second condition, one may modify it by restricting attention to leaf nodes that are siblings of each other. In a modification of the third condition, to respect the hierarchical structure, one may restrict the set of allowable partitions; for example to only allow partitions that do not “cross” sub-tree boundaries.

Formally, given a tree T, and a partition Π of the leaf nodes of T, it may be said that Π respects T, if for every IεΠ, there is a sub-tree of T, denoted T₁, whose leaf nodes are a superset of I, and for every I, JεΠ, the sub-trees T₁and T_Jdo not intersect unless T_I=T_J. If, for example, p was the probability distribution on the leaf nodes of T, we define p_Iand q₁as above. One may define p_Π=(q₁)_IεΠ. One may also define T_Π as follows. For each IεΠ, create node u₁; add u₁and set π(u₁) to be the root of T_I; then remove all nodes in T_Iother than its root. Note that, in this example, p_Π is the probability vector associated with the leaf nodes of T_Π.

One may then establish the following:

Axiom 1. Continuity. H(T, p) is continuous in each p_i.
Axiom 2. More outcomes, increases uncertainty. Let u be the parent of a leaf node of T, such that u has at least k+1 children. Suppose that for all children ν of u, either p_ν=0 or p_ν=p_u/k. Let r be a new vector such that r_u=p_u, for all νεC(u), r_ν=p_ν, and for all νεC(u), either r_ν=0 or r_ν=r_u/(k+1). Then H(T, r)>H(T, p).
Axiom 3. Additivity over sub-trees. Let Π be a partition of the leaf nodes of T that respects T. Let T_Π, p_iand p_Π=(q_I)_IεΠ be defined as above. Then,

$H (T, \overline{p}) = H (T_{Π}, {\overline{p}}_{Π}) + \sum_{I \in Π} q_{I} H (T_{I}, {\overline{p}}_{I}) .$

One may also use the following axioms, which consider an underlying weighted-tree structure.

Axiom 4. Empty nodes do not matter. Suppose T′ is formed from T by removing some subset of nodes u for which p_u=0. Then, H(T′, p)=H(T, p).
Axiom 5. Scaling due to node cost. Let T_α be the tree created by setting c_Tα (ν)=αc_T(ν) for all ν. Then, H(T_α, p)=αH(T, p).

It will now be considered how one may derive the recursive definition postulates R1 and R2 from these axioms. Observe that encoded in R1, is the notion that for “flat” trees, the standard Shannon entropy and tree entropy are the same. More concretely, let S_ndenote the rooted star graph on n+1 nodes, which consists of a root with n children, each of which is a leaf node. Let S₀be the tree consisting of a single node. Then Axiom 3 using tree S_nis precisely the same as Shannon's third condition, as all partitions of the leaf nodes respect S_n. Furthermore, using tree S_m(for m very large), and utilizing Axiom 4, one may see that Axiom 2 yields Shannon's second condition. Hence, in fact tree entropy on S_nwill be precisely Shannon entropy (up to a constant factor). Axiom 5 shows that this constant may be proportional to c(S_n). For convenience, it will be assumed that it is precisely c(S_n). Hence, this presents the base case R1.

With regard to the recursive case, suppose the root of T has children u₁, . . . , u_k, and let T_idenote the sub-tree rooted at u_i, for each iε[k]. Define Π to be the partition of the leaf nodes of T whose i^thpiece consists of the leaf nodes of sub-tree T_i. Applying Axiom 3, one finds that

$H (T, \overline{p}) = H (T_{Π}, {\overline{p}}_{Π}) + \sum_{i \in [k]} p_{u_{i}} H (T_{i}, \overline{p}) .$

Applying Axiom 3 again, this time to T_Π, with partition P′ that-puts each leaf node into a separate class. This time, one finds that H(T, p_Π)=H(S_k, p_Π)+0.

Combining these two leads to the recursive hypothesis R2.

It is next shown that for every cost function c(·), there is a unique tree entropy function that satisfies R1 and R2. For every distribution on the leaf nodes of a given tree, this function agrees with the Shannon entropy when the cost function is equal to 1 for all nodes.

Let T be a tree with root r, the set of leaf nodes l(T) and cost function c(·). For simplicity, we let V(T)\{r} be denoted as V_r.

Theorem 1: The unique function satisfying R1 and R2 is

$\begin{matrix} \begin{matrix} H (T, \overline{p}) = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}}} c (π (v)) p_{v} l g (\frac{p_{π (v)}}{p_{v}}) \\ = - \sum_{v \in V_{\overline{r}} - l (T)} (c (π (v)) - c (v)) (\frac{p_{v}}{p_{T}}) l g (\frac{p_{v}}{p_{T}}) - \\ \sum_{v \in l (T)} c (π (v)) (\frac{p_{v}}{p_{T}}) l g (\frac{p_{v}}{p_{T}}) \\ = - \sum_{v \in V_{\overline{r}}} w (v) (\frac{p_{v}}{p_{T}}) l g (\frac{p_{v}}{p_{T}}) \end{matrix} & \begin{matrix} (2) \\ (3) \\ (4) \end{matrix} \end{matrix}$

where w(ν)=c(π(ν)) if ν is a leaf node, and w(ν)=c(π(ν))−c(ν) otherwise.

The above theorem exposes two different viewpoints of the same concept. First, tree entropy is shown to depend on the relative probabilities of a (parent, child) pair in Equation (2), weighted by the parent cost (e.g., dependency data). Apart from the cost, this differs from Shannon entropy in a critical way: the probability of a node v is considered only with respect to that of its parent, instead of the total probability over all leaf nodes. This is what accounts for the dependencies that are induced by the hierarchy.

The second viewpoint shows that tree entropy presents a weighted version of entropy, wherein the weights w(ν) depend on the costs of both the node and its parent in Equation (4). Thus, the dependencies induced by the hierarchy are taken into account in the weighting parameters instead of in the probabilities.

As a further illustration of tree entropy as measurable, for example, using Equation (4) as shown above, consider the following example based, at least in part, on the exemplary distributions for items 600 and 602 presented in FIG. 6.

With regard to item 600, dependency data for the “North” inner node may be based, at least in part, on the sum of either the distribution data and/or established dependency data for its children nodes. Here, for example, the children nodes, “San Francisco” and “San Jose”, are both leaf nodes and as such their distribution data may be used to establish dependency data for the North node (e.g., equal to 0.4+0.5=0.9).

Similarly, dependency data for the “South” inner node may be based, at least in part, on the sum of either the distribution data and/or established dependency data for its children nodes. Here, for example, the children nodes, “San Diego” and “Los Angeles”, are both leaf nodes and as such their distribution data may be used to establish dependency data for the South node (e.g., equal to 0.05+0.05=0.10).

Based, at least in part, on such distribution data and established dependency data, Equation (4) for example, may be applied to determine a tree entropy value for item 600. At least one weighting parameter may also be applied to further modify all or part of the established dependency data. Thus, the tree entropy value may, for example, be calculated by performing the summation process per Equation (4) which would sum together the distribution data and dependency data for each node in the tree as determined by various multiplication and logarithmic functions. Here, for example, assuming a weighting parameter of 1, the summation may include:

(1×0.4)log 0.4≈−0.16 (for the San Francisco leaf node),

(1×0.5)log 0.5≈−0.15 (for the San Jose leaf node),

(1×0.05)log 0.05≈−0.07 (for the San Diego leaf node),

(1×0.05)log 0.05≈−0.07 (for the Los Angeles leaf node),

(1×0.9)log 0.9≈−0.04 (for the North inner node),

(1×0.10)log 0.10≈−0.1 (for the South inner node), and

when summed together and multiplied by (−1) produces a tree entropy value of ≈0.59 for item 600.

Similarly, with regard to item 602, assuming a weighting parameter of 1, the summation may include:

(1×0.4)log 0.4≈−0.16 (for the San Francisco leaf node),

(1×0.1)log 0.1≈−0.1 (for the San Jose leaf node),

(1×0.4)log 0.4≈−0.16 (for the San Diego leaf node),

(1×0.1)log 0.1≈−0.1 (for the Los Angeles leaf node),

(1×0.5)log 0.5≈−0.15 (for the North inner node),

(1×0.5)log 0.5≈−0.15 (for the South inner node), and

when summed together and multiplied by (−1) produces a tree entropy value of ≈0.82 for item 602.

Thus, as this example illustrates, based, at least in part, on the tree entropy values measured above, item 600 with a tree entropy value of ≈0.59 appears to be more focused than does item 602 with a tree entropy value of ≈0.82.

A proof of theorem 1 is as follows. For all trees T, define

$h (T, \overline{p}) = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}}} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{T}}) .$

Next, it will be shown that h(T, p) satisfies R1 and R2, and then uniqueness will be shown; therefore H(T, p)=h(T, p).

Notice that,

$h (S_{n}, \overline{p}) = \frac{1}{p_{S_{n}}} \sum_{v \in l (S_{n})} c (S_{n}) p_{v} \lg (\frac{p_{S_{n}}}{p_{v}}) = c (S_{n}) H_{1} (\overline{p})$

satisfies R1.

Next, let T be an arbitrary tree with root 7 and cost function c. Let u_I, . . . , u_kdenote the children of r, and let T_idenote the sub-tree of T rooted at u_ifor each iε[k]. As before, let V_r be the set of nodes of T without r, and let V_idenote the set of nodes of T_iwithout u_ifor iε[k]. Thus,

$\begin{matrix} \frac{1}{p_{T}} \sum_{i = 1}^{k} p_{u_{i}} h (T_{i}, \overline{p}) = \frac{1}{p_{T}} \sum_{i = 1}^{k} \frac{p_{u_{i}}}{p_{u_{i}}} \sum_{v \in V_{i}} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{v}}) \\ = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}}} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{v}}) - \\ \frac{1}{p_{T}} \sum_{i = 1}^{k} c (π (u_{i})) p_{u_{i}} \lg (\frac{p_{π (u_{i})}}{p_{u_{i}}}) \\ = h (T, \overline{p}) - \frac{1}{p_{T}} \sum_{i = 1}^{k} c (T) p_{u_{i}} \lg (\frac{p_{T}}{p_{u_{i}}}) \\ = h (T, \overline{p}) - h (S_{k}, \overline{q}) \end{matrix}$

where S_k, the star with k leaf nodes, is the subgraph of T restricted to the root and its children with the natural cost function c(S_k)=c(T), and q=(p_u₁, . . . p_u_k). Rearranging, one may note that

$h (T, \overline{p}) = h (S_{k}, \overline{q}) + \frac{1}{p_{T}} \sum_{i = 1}^{k} p_{u_{i}} h (T_{i}, \overline{p}) .$

Thus, R2 is satisfied. Hence, the function h(T, p) satisfies both R1 and R2.

It will next be shown that h(·,·) is the unique function satisfying R1 and R2. To this end, suppose that g(·,·) is another function satisfying R1 and R2. Since any function satisfying R1 and R2 must satisfy Equation (1),

$g (T, \overline{p}) = - c (T) \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} \lg (\frac{p_{u_{i}}}{p_{T}}) + \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} g (T_{i}, \overline{p}),$

where u_iand T_iare as above. Now, define Δ(T, p)=h(T, p)−g(T, p). Hence,

$\begin{matrix} Δ (T, \overline{p}) = h (T, \overline{p}) - g (T, \overline{p}) \\ = \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} (h (T_{i}, \overline{p}) - g (T_{i}, \overline{p})) \\ = \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} Δ (T, \overline{p}) . \end{matrix}$

By R1, since h and g agree on every star graph S_n, Δ(S_n, p)=h(S_n, {right arrow over (p)})−g(S_n, {right arrow over (p)})=0 for all n. Starting from the leaf nodes of the tree and using the above recurrence, one may note that Δ(T, p) will be identically 0, for all trees and all p. That is, g(T, p)=h(T, p) for all trees and for all p. So h(·,·) is the unique function satisfying R1 and R2.

It is shown next that (3) follows from

$\begin{matrix} \sum_{v \in V_{\overline{r}}} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{v}}) \\ = \sum_{v \in {r} ⋃ V_{\overline{r}} - l (T)} \sum_{α \in C (v)} c (π (α)) p_{α} \lg (\frac{p_{π (α)}}{p_{T}}) \\ = \sum_{v \in {r} ⋃ V_{\overline{r}} - l (T)} c (v) \lg (\frac{p_{v}}{p_{T}}) \sum_{α \in C (v)} p_{α} \\ = c (T) p_{T} \lg (\frac{p_{T}}{p_{T}}) + \sum_{v \in V_{\overline{r}} - l (T)} c (v) p_{v} \lg (\frac{p_{v}}{p_{T}}) \\ = \sum_{v \in V_{\overline{r}} - l (T)} c (v) p_{v} \lg (\frac{p_{v}}{p_{T}}) . \end{matrix}$

Hence,

$h (T, \overline{p}) = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}}} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{v}})$

Equation (4) follows from (3) by definition.

In this section some exemplary properties that may be satisfied by tree entropy are shown. First, the definition of tree entropy trivially includes Shannon entropy as a special case. The next property notes that because of the normalization in the definition of tree entropy, H(T, p) is independent of p_T, the total weight of the probability distribution. Property 4 (presented below) notes that homeomorphic trees have the same tree entropy. The last property extends the concavity of the Shannon entropy to tree entropy.

Property 2: If c(ν)=1 for all nodes, then H(T, p)=H₁( p).

Thus, Shannon entropy may be considered a special case of tree entropy, where all nodes are weighted equally.
Property 3: Let T be a tree, let β>0 be a constant, and let p be a vector, all of whose components are non-negative. Then, H(T, p)=H(T, β p).
Property 4: Let T be a tree with cost function c(·) that has a node x with child y. Form tree T′ by taking tree T, removing edge (x,y), and inserting edges (x,y′) and (y′,y) where y′ is a new node (e.g., such that y is a child of y′, which is a child of x). Let the cost function for T′ be c′(·), which is defined by c′(ν)=c(ν) for all nodes ν in T, and c′(y′) may be any value.

Then H(T, p)=H₁(T′, p) for all p.

This may be proven as follows. Let V_r be the node set of T without the root, and let V′=V_r∪{y′}. Notice that for all nodes ν in tree T, it is the case that p_ν for T is exactly the same as p_ν for T′. Consequently, there is no ambiguity in our notation. Further, since y′ has exactly one child, p_y′=p_y. Hence,

$\begin{matrix} H (T, \overline{p}) = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}}} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{v}}) \\ = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}} - (y)} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{y}}) + c (x) \frac{p_{y}}{p_{T}} \lg (\frac{p_{x}}{p_{y}}) \\ = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}} - {y}} (c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{v}}) + c (x) \frac{p_{y^{'}}}{p_{T}} \lg (\frac{p_{x}}{p_{y^{'}}}) + \\ c (y^{'}) \frac{p_{y}}{p_{T}} \lg (\frac{p_{y^{'}}}{p_{y}})) \\ = \frac{1}{p_{T}^{'}} \sum_{v \in V^{'}} c (π (v)) p_{v} \lg (\frac{p_{π (v)}}{p_{v}}) = H (T^{'}, \overline{p}) . \end{matrix}$

Using the above property, one may extend the tree so that every leaf node is at the same depth, without changing the tree entropy. Thus, one may assume that such trees are leveled.

Property 5: If c(π(ν))≧c(ν) for all nodes ν in tree T, and c(π(ν))≧0 for all leaf nodes ν of T, then for fixed p_T, H(T, p) is a concave function of p.

This may be proven as follows. From Theorem 1,

$H (T, \overline{p}) = - \sum_{v \in V_{\overline{r}}} w (v) (\frac{p_{v}}{p_{T}}) \lg (\frac{p_{v}}{p_{T}}),$

and by an assumption, w(ν)≧0. Let χ_ν be the vector with entries 1/p_Tcorresponding to the leaf nodes in the sub-tree rooted at ν, and 0 for all other leaf nodes. Each term in the sum is of the form f( p)=−y log y, where y=p_ν/p_T=χ_ν^Tp is a linear function of p for fixed p_T. Since affine transformations preserve concavity (the matrix ∇²f( p)=f″(χ_ν^Tp)χ_ν^Tis negative semi-definite since f(y)=−y log y is concave on y>0), each term in the sum is a jointly) concave function of p, and so the weighted sum, with nonnegative weights w(ν), is concave as well.

Examples may be constructed to show that if p_Tis not a constant, H(T, p) is not a concave function of p, so that the condition that p_Tbe fixed is necessary for concavity.

In this section some exemplary techniques are presented that may be used, for example, in choosing a cost function. The definition of tree entropy presented in the examples above assumes an intrinsic cost function associated with the tree. In these non-limiting examples, the only condition that has been imposed on such exemplary cost functions was that cost of a node be greater than or equal to that of its children (c(π(ν))≧c(ν)), in order to ensure concavity of the tree entropy (e.g., see Property 5). In this section, some other exemplary properties are presented that tree entropy may satisfy and/or which may drive a choice of an appropriate cost function should one be desired.

Over all probability distributions pεR″, the Shannon entropy may be maximized for

$\overline{p} = (\frac{1}{n}, \dots, \frac{1}{n}),$

the distribution that corresponds to a maximum uncertainty. For tree entropy, however, a distribution at which tree entropy is maximized for a given tree depends not only on the tree structure but also the cost function c(·). In certain implementations one may, for example, decide to impose conditions on a cost function such that tree entropy is maximized for distributions corresponding to “maximum uncertainty” for the given tree structure.

One may start with the simple case when T is a leveled k-ary tree with n leaf nodes. For this exemplary tree, it may be assumed that the distribution with maximum uncertainty is the uniform distribution on the leaf nodes.

Assume that the probability distribution p on T satisfies p_T=1. Let d(ν) be the depth of any node ν (e.g., the distance of ν from the root). Let d(T) be the depth of the tree. Then, for the cost function c(ν)=d(T)−d(ν)−1, the tree entropy is

$H (T, \overline{p}) = - \sum_{v \in V_{\overline{r}}} p_{v} \lg p_{v} .$

The sum of p_ν over all nodes ν at the same depth from the root is 1 (since p_T=1), so that these numbers form a probability distribution for each level. The above expression may therefore be written as the sum of the Shannon entropies of the probability distributions at each level. The Shannon entropy may be maximized by the uniform distribution, so tree entropy for such a cost function may be maximized by the distribution

$\overline{p} = (\frac{1}{n}, \dots, \frac{1}{n})$

since this distribution leads to a uniform distribution at every level in the tree.

The above argument depended on the fact that the tree was a leveled, k-ary tree. Next leveled trees are considered, which are not necessarily k-ary. It is first shown that distribution on the leaf nodes corresponds to “maximum uncertainty”.

At any node νεV(T), the weight distribution among the children of ν may be maximally uncertain, or most non-coherent, if all the children of ν have equal weights. Labeling the n leaf nodes of T with numbers 1, . . . , n, one may recursively define a probability distribution p_max^TεR″ on the leaf nodes as follows. With r the root, let r=ν₀, ν₁, . . . , ν_d=i be the unique path from the root to leaf i. Then the i^thentry of

${\overline{p}}_{\max}^{T} is \prod_{i = 0}^{d - 1} {b (v_{i})}^{- 1},$

where b(ν) is the number of children of node ν. If the root of T has k children u₁, . . . , u_k, then p_u_i=1/k for all iε[k], according to p_max^T. Also, if T_iis the sub-tree rooted at u_i, then p_max^Tⁱ, the distribution with maximum uncertainty for T_i, is k times the corresponding component of the vector p_max^T, so that H(T_i, p_max^Tⁱ)=H(T_i, k p_max^T)=H(T_i, p_max^T).

It is now considered what conditions one may we impose on a cost function so that the distribution p_max^Tis the one with the highest entropy, e.g., so that H(T, p) is maximized at p= p_max^T.

Let H_max(T)=max_pH(T, p). From Property 3, H_max(T) does not depend on p_T, and hence p_T=1 without loss of generality. As before, let the children of the root be of u₁, . . . , u_k, and let tree T_ibe rooted at u_i. From Equation (1), thus

$H (T, \overline{p}) = - c (T) \sum_{i \in [k]} q_{i} \lg q_{i} + \sum_{i \in [k]} q_{i} H (T_{i}, \overline{p}),$

where q₁=p_u_ifor each iε[k]. Hence,

$H_{\max} (T) = \max_{\bar{p}} {- c (T) \sum_{i \in [k]} q_{i} \lg q_{i} + \sum_{i \in [k]} q_{i} H (T_{i}, \overline{p})} .$

Thus, for example, consider H(T_i, p). Once the values of q_i=p_T_ihave been chosen, the maximum value of H(T_i, {right arrow over (p)}) is 0 if q_i=0, and is precisely H_max(T_i) if q_i>0, by Corollary 3. That is, q_iH(T_i, {right arrow over (p)}) is at most q_iH_max(T_i). Further, since each H(T_i, p) relies on a disjoint set of values of p, and the rest of the expression one may, for example, seek to maximize is independent of p (once the q_ivalues have been chosen), each q_iH(T_i, {right arrow over (p)}) may actually obtain this maximum. Hence,

$\begin{matrix} H_{\max} (T) = \max_{\bar{q}} {- c (T) \sum_{i \in [k]} q_{i} \lg q_{i} + \sum_{i \in [k]} q_{i} H_{\max} (T_{i})} & (5) \end{matrix}$

where the maximum is taken over all q of k−1 components, and q_kis defined to be 1−q_i− . . . −q_k−1. Using this equation, one may show the following result.

Theorem 6: Let T be a tree with root r and cost function c(·), and suppose that c(ν)≧0 for all nodes ν. Then the following are equivalent:
1. H_max(T)=H(T, p_max^T).
2. For every pair of sub-trees U,V of T whose roots are siblings of each other, we have H_max(U)=H_max(V).
3. For every path r=ν₀, ν₁, . . . , ν_dfrom the root of T to a leaf of T, the value

$\sum_{i = 0}^{d - 1} c (v_{i}) \lg b (v_{i})$

is the same.

If any of the above holds, then

$H_{\max} (T) = \sum_{i = 0}^{d - 1} c (v_{i}) \lg b (v_{i})$

for any path r=ν₀, . . . , ν_dfrom r to a leaf of T.

Here is another way to understand this result. Let T₁and T₂be two sub-trees in T whose roots are siblings. The formula for H_max(T) and the associated condition on the cost function says that even if the average branching factor in T₁is much larger than that of T₂, both T₁and T₂contribute equally to the maximum entropy. In terms of the taxonomy, this means, for example, that at any level of the hierarchy, each node (e.g., an aggregated class) captures the same amount of “uncertainty” (or information) about the item. The fact that T₁has larger branching factor on average only means that on average, the mutual coherence of two siblings in T₁is much less than the mutual coherence of siblings in T₂, e.g., T₁makes much finer distinction between classes than T₂.

This may be seen mathematically as follows. Define c′(ν)=c(ν)1g b(ν). Then condition (3) of Theorem 6 says that

$\sum_{i = 0}^{d - 1} c^{'} (v_{i})$

is the same over all paths. By Theorem 1, the formula for tree entropy becomes

$\begin{matrix} H (T, \overline{p}) = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}}} \frac{c^{'} (π (v))}{\lg b (π (v))} p_{v} \lg (\frac{p_{π (v)}}{p_{v}}) \\ = \frac{1}{p_{T}} \sum_{v \in V_{\overline{r}}} c^{'} (π (v)) p_{v} \lg_{b (π (v))} (\frac{p_{π (v)}}{p_{v}}) . \end{matrix}$

In other words, the base of the logarithm is now the branching factor of the parent, reflecting the fact that one may be as uncertain at nodes with high branching factor as over small ones. Another view is that when one encodes messages, one may use a larger alphabet when the branching factor is larger.

Note that, if a node has two (or more) sub-trees, one of which is a leaf node, then condition (3) of Theorem 6 cannot hold unless all of the sub-trees are leaf nodes. Further, if the branching factor at a node, b(ν) is 1, then 1g b(ν)=0. Hence, simply extending the leaf node by adding an edge to it cannot solve the problem (since it does not change the sum in condition (3)). In fact, given T, let T′ be the unique graph with the smallest number of edges, over all graphs homeomorphic to T. Then if one of the leaf nodes of T′ has no siblings, then there is no cost function satisfying the theorem. In those cases, it may make sense to redefine where the maximum tree entropy occurs, by ignoring those “only-children leaf nodes.” On the other hand, if all leaf nodes of T′ have siblings, then there should be no such problem.

A proof of Theorem 6 will now be presented. Throughout, suppose that T has k children u₁, . . . , u_k, and T_iis the sub-tree rooted at u_ifor iε[k]. We let q_i=p_u_ifor iε[k].

First, suppose that condition (1) holds. It may be shown that condition (2) must hold as well, by induction on the height of T. The base case, when T has height 1, follows naturally. So consider a general tree T.

Let,

$f (q_{1}, \dots, q_{k - 1}) = - c (T) \sum_{i \in [k]} q_{i} \lg q_{i} + \sum_{i \in [k]} q_{i} H_{\max} (T_{i}), with q_{k} = 1 - q_{1} - \dots - q_{k - 1} .$

By Equation (5), H_max(T)=max_qf( q).

One may take a partial derivative of f with respect to q₁for t<k. Recall that

$q_{k} = 1 - q_{1} - \dots - q_{k - 1}, hence \frac{\partial q_{k}}{\partial q_{t}} = - 1.$

Thus,

$\begin{matrix} \frac{\partial f}{\partial q_{l}} = - c (T) [\lg q_{l} + \lg e - \lg q_{k} - \lg e] + \\ H_{\max} (T_{l}) - H_{\max} (T_{k}) \\ = c (T) [\lg q_{k} - \lg q_{l}] + H_{\max} (T_{l}) - H_{\max} (T_{k}) \end{matrix}$

Since c(T)≧0, f is a convex function. Hence, f is maximized at the point that all of its partial derivatives are 0. But since condition (1) holds, that will be when p= p_max^T. That is, q₁=1/k for all tε[k]. So at this point,

0=c(T)[−1g k+1g k]+H_max(T_i)−H_max(T_k).

That is, H_max(T_t)=H_max(T_k). Since this is true for all t, one may see that H_max(T_i)=H_max(T_j) for all i,jε[k]. Hence by Equation (5), for any lε[k]

H
_max(T)=c(T)1g k+H_max(T_l) (6).

Recall Equation (1):

$H (T, \overline{p}) = - c (T) \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} \lg (\frac{p_{u_{i}}}{p_{T}}) + \sum_{i \in [k]} \frac{p_{u_{i}}}{p_{T}} H (T_{i}, \overline{p}) .$

Substitute p= p_max^Tinto the above equation. By condition (1), one may see that,

$H_{\max} (T) = c (T) \sum_{i \in [k]} \frac{1}{k} \lg k + \sum_{i \in [k]} \frac{1}{k} H (T_{i}, {\overline{p}}_{\max}^{T}) .$

Combining this with Equation (6), one may see that,

$\sum_{i \in [k]} \frac{1}{k} H (T_{i}, {\overline{p}}_{\max}^{T}) = H_{\max} (T_{l}) .$

Hence, H(T_l, p_max^T)=H_max(T_l). But by definition, H(T_l, p_max^T^l)=H(T_l, k p_max^T)=H(T_l, p_max^T). That is, H(T_l, p_max^T^l)=H_max(T_l). So by induction, every pair of sub-trees U, V of T_lwhose roots are siblings are such that H_max(U)=H_max(V). Since this is true for all l, and H_max(T_i)=H_max(T_j) for all i,jε[k], one may see that condition (2) follows.

Now assume condition (2) holds. It may be shown that condition (1) must hold, by induction on the height of T. The base case, for T consisting of a single node, follows naturally. So consider a general T.

Let f be as above. Again,

$\frac{\partial f}{\partial q_{l}} = c (T) [\lg q_{k} - \lg q_{l}] + H_{\max} (T_{l}) - H_{\max} (T_{k})$

By condition (2), H_max(T_t)=H_max(T_k) for all tε[k]. Hence,

$\frac{\partial f}{\partial q_{l}} = 0$

if and only if q_t=q_k. That is, all the partial derivatives of f are 0 only when q_i=1/k for all iε[k]. Since c(T)≧0, f is convex. So the unique maximum of f occurs at this point. Again, by Equation (5), one may have that H_max(T)=max_qf ({right arrow over (q)}). Hence, H(T, p) is maximized when p_u_i=q_i=1/k for all iε[k]. So one may see,

$H_{\max} (T) = c (T) \lg k + \sum_{i \in [k]} \frac{1}{k} H_{\max} (T_{i}) .$

By induction, one may have that H_max(T_i)=H_max(T_i, p_max^Tⁱ) for all iε[k]i □[k]. Hence,

$\begin{matrix} H_{\max} (T) = c (T) \lg k + \sum_{i \in [k]} \frac{1}{k} H_{\max} (T_{i}, {\overline{p}}_{\max}^{T_{i}}) \\ = c (T) \lg k + \sum_{i \in [k]} \frac{1}{k} H (T_{i}, k {\overline{p}}_{\max}^{T}) \\ = c (T) \lg k + \sum_{i \in [k]} \frac{1}{k} H (T_{i}, {\overline{p}}_{\max}^{T}) \\ = H (T, {\overline{p}}_{\max}^{T}) . \end{matrix}$

Now, suppose that condition (1) holds. It may be shown that condition (3) holds as well. To do this, one may prove by induction on the height of T that for any path r=ν₀, . . . , ν_dfrom the root of T to a leaf node of T,

$H_{\max} (T) = \sum_{i = 0}^{d - 1} c (v_{i}) \lg b (v_{i}) .$

The base case is trivial, so consider a general T.

By Equation (6), one may see that H_max(T)=c(ν₀)1g b(ν₀)+H_max(T_l).

Choose l such that T_lis rooted at node ν₁. Then by induction,

$H_{\max} (T_{l}) = \sum_{i = 1}^{d - 1} c (v_{i}) \lg b (v_{i}) .$

This shows that

$H_{\max} (T) = \sum_{i = 0}^{d - 1} c (v_{i}) \lg b (v_{i}),$

as wanted.

Now, suppose that condition (3) holds. It may again be proven by induction on the height of T that

$H_{\max} (T) = \sum_{i = 0}^{d - 1} c (v_{i}) \lg b (v_{i})$

for any path r=ν₀, . . . , ν_dfrom r to a leaf node of T. The base case, when T is a single node, follows naturally. So consider a general T.

Let lε[k], and note that for all paths u_l=ν₁′,ν₂′, . . . , ν_d′ from the root to T_lto a leaf node of T_l, one may have that (from condition (3)),

$c (v_{0}) \lg b (v_{0}) + \sum_{i = 1}^{d - 1} c (v_{i}^{'}) \lg b (v_{i}^{'})$

is the same. Hence,

$\sum_{i = 1}^{d - 1} c (v_{i}^{'}) \lg b (v_{i}^{'})$

is the same over all such paths. Thus, one may apply an inductive hypothesis to T_l. That is,

$H_{\max} (T_{l}) = \sum_{i = 1}^{d - 1} c (v_{i}^{'}) \lg b (v_{i}^{'}) .$

Consider a path r=ν_o, ν₁, . . . , ν_tfrom r to a leaf of T such that ν₁=u_j. Then, by condition (3),

$c (v_{0}) + \sum_{i = 1}^{t - 1} c (v_{i}) l g b (v_{i}) = c (v_{0}) + \sum_{i = 1}^{d - 1} c (v_{i}^{'}) l g b (v_{i}^{'}) \Rightarrow c (v_{0}) + H_{\max} (T_{j}) = c (v_{0}) + H_{\max} (T_{}) \Rightarrow H_{\max} (T_{j}) = H_{\max} (T_{})$

That is, H_max(T_j)=H_max(T_l), for all j, lε[k]. Hence,

$\begin{matrix} H_{\max} (T) = \max_{\overline{q}} {- c (T) \sum_{i \in [k]} q_{i} l g q_{i} + \sum_{i \in [k]} q_{i} H_{\max} (T_{i})} \\ = \max_{\overline{q}} {- c (T) \sum_{i \in [k]} q_{i} l g q_{i} + H_{\max} (T_{j})} \\ = c (T) l g k + H_{\max} (T_{j}) \\ = c (T) l g b (r) + \sum_{i = 1}^{t - 1} c (v_{i}) l g b (v_{i}) . \end{matrix}$

Let U, V be sub-trees of T with roots x, y, respectively, with x and y siblings. Let r=ν₀, ν₁, . . . , ν_d=π(x) be the path from r to the parent of x (which is also the parent of y). Let x=x₀, x₁, . . . , x_sbe a path from x to a leaf node of U, and let y=y₀, y₁, . . . , y_tbe a path from y to a leaf of V. Then, by condition (3) and the claim just proved,

$\sum_{i = 0}^{d} c (v_{i}) l g b (v_{i}) + \sum_{i = 0}^{s - 1} c (x_{i}) l g b (x_{i}) = \sum_{i = 0}^{d} c (v_{i}) l g b (v_{i}) + \sum_{i = 0}^{t - 1} c (y_{i}) l g b (y_{i}) \Rightarrow \sum_{i = 0}^{s - 1} c (x_{i}) l g b (x_{i}) = \sum_{i = 0}^{t - 1} c (y_{i}) l g b (y_{i}) \Rightarrow H_{\max} (U) = H_{\max} (V) .$

Thus, condition (2) follows.

To finish the proof of the theorem, notice that as just showed that condition (3) implies that

$H_{\max} (T) = \sum_{i = 0}^{d - 1} c (v_{i}) l g b (v_{i})$

for any path r=ν₀, . . . , ν_dfrom r to a leaf node of T.

It is now shown how one may generalize the notion of KL-divergence (see, e.g., S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79-86, 1951) to tree entropy; this aspect may be referred to as “tree divergence”.

Since the KL-divergence is a measure of the similarity of two probability distributions over the same alphabet, one may think of tree divergence as dealing with two probability distributions over the same tree with the same cost function. The argument presented here may, for example, be generalized to distributions over different trees; the results are less intuitive.

Recall the KL-divergence can be defined in terms of Bregman divergence (see, e.g., L. M. Bregman. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200-217, 1967). For any concave, continuously-differentiable function, f, the Bregman divergence of f, denoted B_f(·∥·) is defined as B_f( p∥ q)=f( p)−f( q)−( q)·( p− q).

The KL-divergence is defined as the Bregman divergence of the entropy function,

$H_{1} (\vec{p}) = \sum_{i} p_{i} l g p_{i} .$

Notice that one may assume that

$\sum_{i} p_{i} = 1,$

an ignore that constraint when taking a derivative.

Likewise, one may define the tree divergence as the Bregman divergence of the tree entropy function, where one may ignore the normalization. Fix a tree T, and denote the tree divergence for tree T by KL_T(·∥·). For convenience, assume that

$\sum_{i} p_{i} = \sum_{i} q_{i} = 1.$

Let V_rbe the set of nodes in T without the root, and let w(·) be as in Theorem 1. Define

$φ (\overline{p}) = - \sum_{v \in V_{\overline{r}}} w (v) p_{v} l g p_{v},$

and define KL_T( p∥ q)=B_φ( p∥ q). This leads to the following.

Theorem 7: Let T be a tree, and let V_rbe its node set without the root. Let w(·) be defined as in Theorem 1. Then for

$\sum_{i} p_{i} = \sum_{i} q_{i} = 1,$

$K L_{T} (\vec{p} \langle \rangle \vec{q}) = \sum_{v \in V_{\overline{r}}} w (v) p_{v} l g (q_{v} / p_{v}) .$

A proof of Theorem 7 will now be presented. Recall that

$φ (\vec{p}) = - \sum_{v \in V_{\overline{r}}} w (v) p_{v} l g p_{v} .$

One may first calculate ∇₁₀₀. Recall that if ν lies in the path from the root to leaf node i, then

$\frac{\partial q_{v}}{\partial q_{i}} = 1,$

otherwise it is 0. Let path_ibe the set of nodes in the path from the root of T to the leaf node i, not including the root itself. One may have that the i^thentry of ∇₁₀₀( q) is

$\begin{matrix} \nabla {φ (\overline{q})}_{i} = - \sum_{v \in {path}_{i}} w (v) (l g q_{v} + l g e) \\ = - \sum_{v \in {path}_{i}} w (v) l g q_{v} - c (T) l g e \end{matrix}$

Hence,

$\begin{matrix} \nabla φ (\vec{q}) \cdot (\vec{p} - \vec{q}) = - \sum_{i \in [n]} \sum_{v \in {path}_{i}} w (v) (p_{i} - q_{i}) l g q_{v} + \\ \sum_{i \in [n]} c (T) (p_{i} - q_{i}) l g e \\ = - \sum_{v \in V_{\overline{r}}} w (v) (p_{v} - q_{v}) l g q_{v} + 0 \end{matrix}$

Thus,

$\begin{matrix} B_{T} (\vec{p} \langle \rangle \vec{q}) = - \sum_{v \in V_{\overline{r}}} w (v) p_{v} l g p_{v} + \sum_{v \in V_{\overline{r}}} w (v) q_{v} l g q_{v} \\ = \sum_{v \in V_{\overline{r}}} w (v) p_{v} l g (q_{v} / p_{v}) . \end{matrix}$

In this section we provide additional interpretation of the definition of tree entropy is presented via an exemplary generative model. Here, it will be assumed that tree T has exactly n leaf nodes and c(T)=1.

First, consider a very straightforward generative model. Starting at the root of T, move to one of its children, with probability of going to child u exactly p_u. Once arriving at this new node, go to one of its children, with the probability of going to child ν exactly p_ν. Repeat this until a leaf node is reached. At this point, output the name of that node. Repeating this process over and over, it is easy to see that this generates a string of leaf names, with the probability of outputting leaf ν equal to p_ν. So the entropy of this sequence is just the Shannon Entropy of the distribution p.

One extension of this would be to output the entire path taken. But it is not hard to see that the entropy of the sequence generated in this way is precisely the same as the entropy of the sequence consisting only of leaf names, since each leaf name uniquely determines the path to the root.

Rather than simply outputting the entire path from root to leaf, suppose that it is desired to output, for example,.the fourth node in the path. For instance, in a classifier and taxonomy example, one might desire a classification of the element with some specified level of granularity. Items that are close together in the tree may look identical at coarse levels of granularity, while items that are far from each other in the tree may still be different. More specifically, choose a path as above, e.g., ν₀, ν₁, . . . , ν_d, where ν₀is the root and ν_dis a leaf node. Output exactly one of ν₁, . . . , ν_d, with the probability of outputting ν_iequal to w(ν_i). Recall, that w(ν_i)=c(ν_i−1)−c(ν_i) for i<d and w(ν_d)=c(ν_d−1). Notice that, since it was assumed c(T)=1, the sum of these probabilities is exactly 1. Here, for example, when outputting a node name one may also record on which level it is.

Upon transmitting the sequence of node names generated by repeating this process, assuming that both the transmitter and the receiver knows from which level each node name came, leads to the following.

Theorem 8: Tree entropy is the best-case asymptotic rate for this transmittal

Put another way, tree entropy for T is equal to the Shannon entropy of the above sequence, conditioned on knowing the level for the i^thnode name produced, for all i.

In the foregoing detailed description the notion of entropy of a distribution specified on the leaf nodes of a tree has been systematically developed. As shown, this definition may be a unique solution to a small collection of axioms and may be a strict generalization of Shannon entropy. Tree entropy, for example, may be adapted for a variety of different data processing tasks, such as, data mining applications, including classification, clustering, taxonomy management, and the like.

While certain exemplary techniques have been described and shown herein using various methods and systems, it should be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from claimed subject matter. Additionally, many modifications may be made to adapt a particular situation to the teachings of claimed subject matter without departing from the central concept described herein. Therefore, it is intended that claimed subject matter not be limited to the particular examples disclosed, but that such claimed subject matter may also include all implementations falling within the scope of the appended claims, and equivalents thereof.

HIERARCHICAL STRUCTURE ENTROPY MEASUREMENT METHODS AND SYSTEMS

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims