1. Field of the Invention
This invention relates to hierarchical clustering of objects, and more particularly, to methods, systems, and articles of manufacture for soft hierarchical clustering of objects based on a co-occurrence of object pairs.
2. Background of the Invention
The attractiveness of data categorization continues to grow based, mostly in part, on the availability of data through a number of access mediums, such as the Internet. As the popularity of such mediums increases, so has the responsibility of data providers to offer quick and efficient access to data. Accordingly, these providers have incorporated various techniques to ensure data may be efficiently accessed. One such technique is the organization of data using clustering. Clustering allows data to be hierarchically grouped (or clustered) based on the its characteristics. The premise behind such clustering techniques is that objects, such as text data in documents, that are similar to each other are placed in a common cluster in a hierarchy. For example, subject catalogs offered by data providers such as Yahoo™, may categorize data by creating a hierarchy of clusters where general category clusters are located at top levels and lower level cluster leaves are associated with more specific topics.
Although conventional organization techniques, such as hierarchical clustering, allow common objects to be grouped together, the resultant hierarchy generally includes a hard assignment of objects to clusters. A hard assignment refers to the practice of assigning objects to only one cluster in the hierarchy. This form of assignment limits the potential for an object, such as a textual document, to be associated with more than one cluster. For example, in a system that generates topics for a document collection, a hard assignment of a document (object) to a cluster (topic) prevents the document from being included in other clusters (topics). As can be seen, hierarchical clustering techniques that result in hard assignments of objects, such as text data, may prevent these objects from being effectively located during particular operations, such as text searches on a document collection.
It is therefore desirable to have a method and system for hierarchically clustering objects such that any given object may be assigned to more than one cluster in a hierarchy.
Methods, systems and articles of manufacture consistent with certain principles related to the present invention, enable a computing system to receive a collection of documents, each document including a plurality of words, and assign portions of a document to one or more clusters in a hierarchy based on a co-occurrence of each portion with one or more words included in the document. Methods, systems, and articles of manufacture consistent with certain principles related to the present invention may perform the assignment features described above by defining each document in a collection as a first object (e.g., “i”) and the words of a given document as a second object (e.g., “j”). Initially, the collection may be assigned to a single class that may represent a single root cluster of a hierarchy. A modified Expectation-Maximization (EM) process consistent with certain principles related to the present invention may be performed based on each object pair (i,j) defined within the class until the root class splits into two child classes. Each child class is then subjected to the same modified EM process until the respective child class splits again into two more child classes. The process repeats until selected constraints associated with the hierarchy have been met, such as when the hierarchy reaches a maximum number of leaf clusters. The resultant hierarchy may include clusters that each include objects that were assigned to other clusters in the hierarchy, including clusters that are not ancestors of each other.
Additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of methods, systems, and articles of manufacture consistent with features of the present invention. The aspects of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several aspects of the invention and together with the description, serve to explain the principles of the invention. In the drawings,
Methods, systems, and articles of manufacture consistent with features and principles of the present invention enable a computing system to perform soft hierarchical clustering of a document collection such that any document may be assigned to more than one topic in a topical hierarchy, based on words included in the document.
Methods, systems and articles of manufacture consistent with features of the present invention may perform the above functions by implementing a modified Expectation-Maximization (EM) process on object pairs reflecting documents and words, respectively, such that a given class of the objects ranges over all nodes of a topical hierarchy and the assignment of a document to a topic may be based on any ancestor of the given class. Moreover, the assignment of a given document to any topic in the hierarchy may be based on a particular (document, word) pairs under consideration during the process. Methods, systems, and articles of manufacture, consistent with certain principles related to the present invention may perform the modified EM process for every child class that is generated from an ancestor class until selected constraints associated with the topical hierarchy are met. A representation of the resultant hierarchy of topical clusters may be created and made available to entities that request the topics of the document collection.
Reference will now be made in detail to the exemplary aspects of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
The above-noted features and other aspects and principles of the present invention may be implemented in various environments. Such environments and related applications may be specially constructed for performing the various processes and operations of the invention or they may include a general purpose computer or computing platform selectively activated or reconfigured by program code to provide the necessary functionality. The processes disclosed herein are not inherently related to any particular computer or other apparatus, and may be implemented by a suitable combination of hardware, software, and/or firmware. For example, various general purpose machines may be used with programs written in accordance with teachings of the invention, or it may be more convenient to construct a specialized apparatus or system to perform the required methods and techniques.
The present invention also relates to computer readable media that include program instruction or program code for performing various computer-implemented operations based on the methods and processes of the invention. The program instructions may be those specially designed and constructed for the purposes of the invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of program instructions include for example machine code, such as produced by a compiler, and files containing a high level code that can be executed by the computer using an interpreter.
Processor 102 may be any general-purpose or dedicated processor known in the art that performs logical and mathematical operations consistent with certain features related to the present invention. Although
Main memory 104 and supplemental memory 106 may be any known type of storage device that stores data. Main memory 104 and supplemental memory 106 may include, but are not limited to, magnetic, semiconductor, and/or optical type storage devices. Supplemental memory 106 may also be a storage device that allows processor 102 quick access to data, such as a cache memory. In one configuration consistent with selected features related to the present invention, main memory 104 and supplemental memory 106 may store data to be clustered, clustered data, and/or program instructions to implement methods consistent with certain features related to the present invention.
Bus 108 may be a single and/or multiple bus configuration that allows data to be transferred between components of computer 100 and external components, such as the input/output devices comprising keyboard 110, display 112, network connector 114, and mass storage 116. Keyboard 110 may allow a user of the computing system environment to interact with computer 100 and may be replaced and/or supplemented by other input devices, such as a mouse, touchscreen components, or the like. Display 112 may present information to the user as is known in the art. Network connector 114 may be any known connection device that allows computer 100 to connect to, and exchange information with, a network such as a local-area network, or the Internet. Mass storage 116 may be any known storage device external to computer 100 that stores data. Mass storage 116 may comprise of magnetic, semiconductor, optical, and/or tape type storage devices and may store data to be clustered, clustered data, and/or program instructions that may be executed by processor 102 to perform methods consistent with certain features related to the present invention.
It should be noted that the configuration of the computing system environment shown in
Computer 100 may be configured to perform soft hierarchical clustering of objects, such as textual documents that each include a plurality of words. There are several ways soft hierarchical clustering may be performed, such as using maximum likelihood and a deterministic variant of the Expectation-Maximization (EM) algorithm. The maximum likelihood technique is one which is aimed at finding parameter values that maximize the likelihood of observing data, and is a natural framework of clustering techniques. The EM algorithm is a known algorithm used to learn the parameters of a probabilistic model within maximum likelihood. Additional description of the EM Algorithm may be found in G. J. McLachlan and T. Krishnan, “The EM Algorithm and Extensions,” Wiley, New York, 1997, which is hereby incorporated by reference. A variant of the EM algorithm, known as deterministic annealing EM, performs hierarchical clustering of objects. In certain instances, however, such hierarchical clustering may result in the hard assignment of the objects. Additional information on deterministic annealing EM may be found in Rose et al., “Statistical Mechanics and Phase Transitions in Clustering,” Physical Review Letters, Vol. 65, No. 8, American Physical Society, Aug. 20, 1990, pages 945-48, which is hereby incorporated by reference.
Deterministic annealing EM presents several advantages over the standard EM algorithm. The following is a brief description of this variant of the EM algorithm.
Deterministic Annealing EM
Given an observable data sample χ(εχ), with density p(χ; Θ), where Θ is the parameter of the density distribution to be estimated, there exists a measure space Υ of unobservable data that corresponds to χ.
Furthermore, given incomplete data samples {X=χr|r=1, . . . , L}, the goal of the EM algorithm is to compute the maximum likelihood estimate of Θ that maximizes the likelihood function. This amounts to maximizing the complete data log-likelihood function, noted Lc, and is defined as:
Furthermore, the iterative procedure, which, starting with an initial estimate of Θ, alternates the following two steps, has been shown to converge to a local maximum of the (complete data) log-likelihood function. This procedure is called the EM algorithm.
E-Step: Compute the Q-function as:
Qβ(Θ; Θ(t))=E(Lc(Θ; X)|X, Θt)
M-Step: Set Θ(t+1) equal to Θ to maximize Qβ(Θ; Θ(t)).
By substituting for Lc(Θ; X), Qβ(Θ; Θ(t)) may be rewritten as:
And, because
Q(Θ; Θ(t)) may be obtained, and written as:
The deterministic annealing variant of the EM algorithm includes parameterizing the posterior probability in p(yr|xr; Θ(t)) with a parameter β, as follows:
As can be seen, when β is 1, f(yr|xr; Θ)=p(yr|xr; Θ). Accordingly, when the probability p(yr|xr; Θ(t)) defined in the formula for Q(Θ; Θ(t)) is substituted with f(yr|xr; Θ), the function Qβ coincides with the Q-function of the EM algorithm. This suggests the deterministic annealing EM algorithm. The properties of the deterministic annealing EM algorithm may be found in Ueda et al., “Advances in Neural Information Processing Systems 7,” Chapter on Deterministic Annealing variant of the EM Algorithm, MIT Press, 1995, which describes the process as:
1. Set β=βmin, 0<βmin<<1;
2. Arbitrarily choose an initial estimate Θ(0), and Set t=0;
3. Iterate the following two steps until convergence:
4. Increase β; and
5. If β<βmax, set t=t+1, and repeat the process from step 3; otherwise stop.
The deterministic annealing EM process described above presents three main advantages over the standard EM algorithm: (1) it is more likely to converge to a global maximum than the standard EM algorithm; (2) it avoids over fitting by setting βmax<1; and (3) because the number of clusters needed to explain data depends on β, it induces a hierarchy of clusters.
Variations of deterministic annealing EM have been proposed to help induce a hierarchy of objects. One such model called the Hierarchical Asymmetric Clustering Model (HACM) includes a technique referred to as distributional clustering. Additional information on the HACM may be found in Hofmann et al., “Statistical Models for Co-Occurrence Data,” A. I. Memo No. 1625, Massachusetts Institute of Technology, 1998. The HACM relies on two hidden variables. The first, Iiα, describes the assignment of an object “i” to a class α. The second, Vrαv, describes the choice of a class v in a hierarchy given a class α and objects i and j. The notation (i,j) represents a joint occurrence of object i with object j, where (i,j) εIXJ, and all data is numbered and collected in a sample set S=(i(r),j(r), r):1≦r≦L. The two variables, Iiα and Vrαv are binary valued, which leads to a simplified version of the likelihood function.
To further explain the HACM,
As shown, the HACM allows i objects to be generated via the probability p(i), which depends on i. Furthermore, the generation of the j object of any couple (i(r),j(r)) such that i(r)=i is determined by a class α through Iiα. Accordingly, it can be seen that the generation of the object j is dependent on i and the set of ancestors of α, through the variable Vrαv.
The HACM is based on the following probability:
However, since there are exactly ni objects for which i(r)=i, and since Vrαv are binary valued, and equal to 0 for all but the (unknown) class v(r) used to generate j(r), p(Si|α(i)) may be rewritten as:
The complete model formula for p(Si) may be obtained by summing on α(i), and may be written as:
Although the probability p(Si) presented above represents a simplified version of the HACM because v is conditioned only by α, and not by α and i (p(v|α, i)=p(v|α)), one skilled in the art would realize that the characteristics and operations of the HACM described herein apply to the complex version as well.
It should be noted that the product is taken over (i,j) pairs, where i is fixed. Accordingly, the product may be viewed as only being over j. From the above model, the formula for p(Si), which is the complete data log-likelihood c, and may be represented as:
Another variant of deterministic annealing EM is described in L. D. Baker et al., “A Hierarchical Probabilistic Model for Novelty Detection in Text,” Neural Information Processing Systems, 1998. The model described in Baker et al. may be referred to as a Hierarchical Markov Model (HMLM). Like the HACM, the HMLM directly models p(Si) based on the following formula:
The log-likelihood for complete data may be obtained for the HMLM from p(Si), and may be written as:
Although the HACM and HMLM may provide soft hierarchical clustering of objects, it is important to keep in mind that these models may still result in hard assignments because of two properties associated with the models: First, the class α ranges only over leaves of the hierarchy, and the class v ranges only over the ancestors of α; and second, the contributions from objects j are directly collected in a product. The first property shows that objects i will only be assigned to the leaves of an induced hierarchy. For example, referring to
Methods, systems, and articles of manufacture consistent with certain principles related to the present invention eliminates the reliance on leaf nodes alone, and allows any set Si to be explained by a combination of any leaves and/or ancestor nodes included in an induced hierarchy. That is, i objects may not be considered as blocks, but rather as pieces that may be assigned in a hierarchy based on any j objects they co-occur with. For example, in one configuration consistent with certain features and principles related to the present invention, a topical clustering application performed by computer 100 may assign parts of a document i to different nodes in an induced hierarchy for different words j included in the document i. This is in contrast to the HACM and HMLM where it is assumed that each document i is associated with the same leaf node in an hierarchy for all words j included in the document i.
One embodiment of the present invention may directly model the probability of observing any pair of co-occurring objects, such as documents and words (i,j), by defining a variable Irα (controls the assignment of documents to the hierarchy) such that it is dependent on the particular document and word pair (i,j) under consideration during a topical clustering process. In one configuration consistent with certain principles related to the present invention, class α may range over all nodes in an induced hierarchy in order to assign a document (i object) to any node in the hierarchy, not just leaves. Furthermore, class v may be defined as any ancestor of α in the hierarchy. The constraint on v ensures that the nodes are hierarchically organized.
An alternative formulation to the equation p(i(r),j(r)) is to replace p(α)p(i(r)|α) with p(i(r))p(α|i(r), both of which are equal to p(α,i(r)). Thus, the alternate equation would be:
By workaround, the equal, alternative formulation could be used to achieve the same result as the original equation for p(i(r),j(r)).
To more clearly illustrate the differences between the previous models and the present invention, p(Si) may be derived for the present invention since p(Si)=Πr:i(r)=ip(i(r), j(r)). Therefore, p(Si) may be written as:
The complete data log-likelihood may then be given by:
As can be seen from the derived formula for p(Si), the j objects, for a given α, are not collected in a product as in the case of the HACM and HMLM. Instead, the present invention determines the probability p(Si) such that the product is taken only after mixing over all the classes α. Thus, different j objects may be generated from different vertical paths of an induced hierarchy. That is, the paths in the hierarchy associated with non null values of Iiα. The constraint in the HACM and HMLM that all j objects have to be generated from the same vertical paths in a hierarchy forces Iiα to have binary values. Methods, systems, and articles of manufacture that implement the model represented in
As mentioned previously, one embodiment of the present invention may perform a modified deterministic annealing EM process to implement the model shown in FIG. 5. In one configuration consistent with certain principles related to the present invention, Θ in the probability p(xr, yr; Θ)) is associated with the current set of estimates given by the probability p(i(r),j(r)). Accordingly, the Q function consistent with features and principles of the present invention may be defined as:
Methods, systems, and articles of manufacture consistent with features of the present invention may also implement a modified E and M step of the deterministic annealing EM process to determine the probabilities associated with the model shown in
However, because
A in the equation above may be defined as:
Similar to the determination of A, B may be obtained in the following form:
As described, <Iijα>β and <IijαVijαv>β correspond to the E-step process of the modified deterministic annealing EM process consistent with certain principles related to the present invention. Moreover, <IijαVijαv>β corresponds to the assignment to any ancestor in the induced hierarchy given α.
The modified M-step process performed by one embodiment of the present invention aims at finding the parameter the parameter Θ which maximizes Qβ(Θ; Θ(t)). Inherent in such probability distributions is the constrained optimization restriction associated with the constraints having the form:
In one configuration consistent with certain principles related to the present invention, Lagrange multipliers may be used to search for the corresponding unconstrained maximum. For example, to derive the probability p(α) implemented in the model shown in
Using the same principle as above, the remaining probabilities implemented in the model shown in
As described, the probabilities p(α; Θ), p(i|α; Θ), p(v|α; Θ), and p(j|v; Θ) define the M-step re-estimation processes used in the modified deterministic annealing EM process implemented by the present invention.
Methods, systems, and articles of manufacturer consistent with certain principles related to the present invention may be configured to implement the model depicted in
A document collection may be located in any of the memories, 104, 106, and 114. Also, a document collection may be located remote from the computing environment shown in
Referring to
Referring back to
Next, processor 102 may perform the modified E-step in the modified deterministic annealing EM process consistent with certain principles related to the present invention (Step 625). Accordingly, Qβ(Θ; Θ(t)) may be computed according to the formulas described defined above consistent with features and principles related to the present invention (i.e., Qβ(Θ; Θ(t))=A+B), given the class α and the defined value of parameter β.
Processor 102 may also perform the maximization process given the class α and the defined value of parameter β in accordance with certain principles related to the present invention (Step 630). That is, the probability distributions p(α; Θ), p(i|α; Θ), p(v|α; Θ), and p(j|v; Θ) are determined. Once the modified deterministic annealing EM process consistent with certain principles related to the present invention is performed, processor 102 may determine whether the class α has split into two child classes (Step 635).
In one configuration consistent with certain principles related to the present invention, processor 102 may recognize a split of class α based on the probability distribution p(i|α). Initially, when the parameter β is set to a very low value, all documents and words (i and j) included in the document collection have the same probability of being assigned to class α. However, as the value of the parameter β increases, the same probability associated with different documents based on different words included in these documents begin to diverge from each other. This divergence may result in two classes (or clusters) of documents being realized from an ancestor class, whereby each child class includes documents that have a similar probability p(i|α) value based on different words included in each respective document. For example, suppose the document collection that is initially assigned to class α in Step 615 includes document DOC1, containing words W1, W2, and W3, and document DOC2 containing words W4, W5, and W6. This initial class α including DOC1 and DOC2 may produce the same probability p(i|α) for each document in the collection at an initial value of parameter β based on the words in each respective document. However, at a higher value of β, the same class α may result in a first probability p(i|α) associated with DOC1 based on W1, and a second probability for DOC1 based on W2. Similarly at the higher value of β, DOC2 may be associated with the first probability based on W4, W5, and W6. It should be noted that in accordance with certain principles related to the present invention, a single document, such as DOC1, may be assigned to two classes (or clusters) based on the words included within the single document.
In Step 635, processor 102 may be configured to determine whether the probability p(i|α) associated with each document in the collection is the same, or falls into one of two probability values corresponding to the rest of the documents in the collection. In the event processor 102 determines that there has been a split of the class α (Step 635; YES), it may determine whether the conditions defined in Step 605 have been met (Step 640). At this stage in the process, a hierarchy is being induced (i.e., the split of class α into two child classes). Accordingly, if processor 102 determines that a condition (e.g., a maximum number of leaves) has been met (Step 640; YES), the induced hierarchy has been completed, and the documents have been clustered based on the topics associated with the words included in each document, and the clustering process ends (Step 645).
If processor 102 determines that the initial class α has not split at the current value of parameter β (Step 635; NO), the value of the parameter β may be increased (Step 650), and the process returns to Step 625 using the increased value of parameter β. The manner in which the parameter β is increased may be controlled using a step value, which may be predetermined by a user or computed from the initial value of the parameter β and additional parameters provided by the user (i.e., the number of clusters, the depth of the hierarchy, etc.). Furthermore, in the event that the initial class α has split into two child classes (each of which is defined as a separate class α) (Step 635; YES), but the conditions of the hierarchy have not been met (Step 640; NO), processor 102 may set the parameter β for each new child class α to the value that caused the initial class α to split (Step 655). Processor 102 may then perform the same steps for each new child class α (Steps 625-655) until the conditions of the hierarchy have been met (Step 640; YES), and the clustering process ends (Step 645).
In one configuration consistent with certain principles related to the present invention, the end of the cluster process (Step 645) may be proceeded by the creation of a representation associated with the induced hierarchy by computer 100 and may be stored in a memory (i.e., memories 106, 104, and/or 116). The representation may reflect the topics associated with the clustered document collection, and may be created in a variety of forms, such as, but not limited to, one or more tables, lists, charts, graphical representations of the hierarchy and/or clusters, and any other type of representation that reflects the induced hierarchy and the clusters associated with topics of the document collection. Computer 100 may make the stored representation available to a requesting entity, as previously described, in response to a request to perform a clustering operation (i.e., determine topics of a document collection). The representation may be made available to an entity via the network connector 114, or bus 108, and may be sent by computer 100 or retrieved by the entity. Additionally, computer 100 may be configured to send the representation of the hierarchy to a memory (such as a database) for retrieval and/or use by a entity. For example, a server located remotely from computer 100 may access a database that contains one or more representations associated with one or more hierarchies provided by computer 100. The hierarchies may include clusters of topics associated with the one or more document collections. For example, the server may access the database to process a search operation on a particular document collection. In another embodiment consistent with certain principles related to the present invention, computer 100 may make the representation available to a user through display 112. In this configuration, computer 100 may create a graphical representation reflecting the induced hierarchy and the topics reflected by the hierarchy's clusters, and provide the representation to display 112 for viewing by a user.
To further describe certain configurations consistent with the present invention,
As shown, hierarchy 700 includes seven nodes (710-770), and four leaves (740-770). Each node may be associated with the first five words in the collection for which p(j|v) is the highest. During the generation of hierarchy 700 by the present invention, the document collection associated with node 710 (defined within class α1 with parameter β1) may have been separated into two child topic/clusters when a split of class α1 was determined following the increase of the value of parameter β1. In exemplary hierarchy 700, the two child topic/clusters are associated with nodes 720 and 730, defined by classes α11 and α12, respectively, and the split of class α1 may have occurred at a parameter value of β2.
During subsequent generation, each class α11 and α12, may have split into two child topics/clusters when the value of parameter was increased from β2 to β3. As shown, node 720, defined by class α11, may have split into nodes 740 and 750, defined by classes α21 and α22, respectively. Node 730 defined by class α12, on the other hand, may have split into nodes 760 and 770, defined by classes α23 and α24, respectively.
As can be seen in
It should be noted that in one embodiment, the “title” of the topics associated with each cluster/node of hierarchy 700 may be provided by a user. For instance, the user may be provided with the N most probable words associated with each cluster/node. From these words, the user may then infer a “title” for the cluster/node which is associated with a topic. Alternatively, the “title” for each cluster/node may be determined automatically by processor 102. In this configuration, processor 102 may extract the most frequent n-grams from the documents associated with a particular cluster/node, and determine a “title” for the cluster/node based on the extracted n-grams.
In one configuration consistent with certain principles related to the present invention, computer 100 may be configured to evaluate the adequacy of a topical hierarchy induced by one embodiment of the present invention. In this configuration, processor 102 may execute instructions or program code that allow the clusters included in an induced hierarchy based on a test document collection, to be compared to a set of manual labels previously assigned to the test collection. To perform this evaluation, processor 102 may use the average of the Gini function over the labels and clusters included in the induced hierarchy, and may be defined as:
In the above Gini functions, L reflects the number of different labels and Λ reflects the number of different clusters. Additionally, Gl measures the impurity of the obtained clusters α with respect to the labels l, and reciprocally for Gα. Smaller values of the Gini functions Gl and Gα indicate better results because clusters and labels are in closer correspondence. That is, if data clusters and label clusters contain the same documents with the same weights, the Gini index is 0. The Gini functions Gl and Gα each have an upper bound of 1.
Accordingly, when computer system 100 seeks to evaluate the effectiveness of the soft hierarchical clustering operations consistent with certain principles related to the present invention, a test document collection may be accessed and the process shown in
In one configuration consistent with certain principles related to the present invention, the Gini indexes associated with the process shown in
As described, the present invention enables a computing system to produce topic clusters from a collection of documents and words, such that each cluster may be associated with documents that are assigned to other clusters. Accordingly, the hard assignment of objects in an induced hierarchy of clusters is eliminated.
It should be noted that the present invention is not limited to the implementations and configurations described above. One skilled in the art would recognize that a number of different architectures, programming languages, and other software and hardware combinations may be utilized without departing from the scope of the present invention.
Moreover, it should be noted that the sequence of steps illustrated in
Moreover, the present invention may allow a hierarchy of topic clusters associated with a document collection to be updated based on a new document (or documents) added to the collection. In this configuration, computer 100 may allow a document collection to be updated with the addition of one or more new documents, and perform a clustering operation consistent with certain principles related to the present invention on the modified collection. Accordingly, the present invention may be implemented to modify a topic hierarchy associated with a document collection, each time a new document (or a set of documents) is added to the collection.
Furthermore, the present invention may be employed for clustering users based on the actions they perform on a collection of documents (e.g., write, print, browse). In this configuration, the “i” objects would represent the users and the “j” objects would represent the documents. Additionally, the present invention may be employed for clustering images based on text that is associated with the images. For example, the associated text may reflect a title of the image or may be text surrounding the image such as in a web page. In this configuration, the “i” objects would represent the images and the “j” objects would represent the words contained in the title of each image. Also, the present invention may be employed to cluster companies based on their activity domain or consumer relationships. For example, in the latter application, the “i” objects would represent the companies and the “j” objects would represent a relation between the companies and their consumers (e.g., “sells to”). That is, one or more business entities may have a set of customer who purchased different types of products and/or services from the business entities. Accordingly, in accordance with certain aspects of the present invention, the clusters of a hierarchy may represent groups of customers who purchased similar types of products and/or services from the business entities (e.g., buys hardware, buys computer software, buys router parts, etc.). Therefore, in this configuration, “i” may represent the customers and “j” may represent the business entities. Alternatively, another configuration may include a set of customers who purchase various types of products and/or services from particular types of business entities. In this configuration, the clusters of the hierarchy may represent groups of product and/or service types (e.g., sells hardware, sells computer software, sells paper products, etc.) In this configuration, “i” may represent the business entities and “j” may represent the customers. Accordingly, one skilled in the art would realize that the present invention may be applied to the clustering of any type of co-occurring objects.
Additionally, although aspects of the present invention are described as being associated with data stored in memory and other storage mediums, one skilled in the art will appreciate that these aspects can also be stored on or read from other types of computer-readable media, such as secondary storage devices, like hard disks, floppy disks, or CD-ROM; a carrier wave from the Internet; or other forms of RAM or ROM. Accordingly, the invention is not limited to the above described aspects of the invention, but instead is defined by the appended claims in light of their full scope of equivalents.
Number | Name | Date | Kind |
---|---|---|---|
5761418 | Francis et al. | Jun 1998 | A |
5819258 | Vaithyanathan et al. | Oct 1998 | A |
5864855 | Ruocco et al. | Jan 1999 | A |
5983246 | Takano | Nov 1999 | A |
6078913 | Aoki et al. | Jun 2000 | A |
6154213 | Rennison et al. | Nov 2000 | A |
6233575 | Agrawal et al. | May 2001 | B1 |
6460025 | Fohn et al. | Oct 2002 | B1 |
6460036 | Herz | Oct 2002 | B1 |
6556958 | Chickering | Apr 2003 | B1 |
6742003 | Heckerman et al. | May 2004 | B2 |
20020129038 | Cunningham | Sep 2002 | A1 |
20030018637 | Zhang et al. | Jan 2003 | A1 |
Number | Date | Country |
---|---|---|
0 704 810 | Apr 1996 | EP |
Number | Date | Country | |
---|---|---|---|
20030101187 A1 | May 2003 | US |