Thematic clustering

Description

BACKGROUND OF THE INVENTION

Information has traditionally been manually classified to aid potential readers in locating works of interest. For example, books are typically associated with categorization information (e.g., one or more Library of Congress classifications) and academic articles sometimes bear a list of keywords selected by their authors or editors. Unfortunately, while manual classification may be routinely performed for certain types of information such as books and academic papers, it is not performed (and may not be feasibly performed) for other types of information, such as the predominantly unstructured data found on the World Wide Web.

Attempts to automatically classify documents can also be problematic. For example, one technique for document classification is to designate as the topic of a given document the term occurring most frequently in that document. A problem with this approach is that in some cases, the most frequently occurring term in a document is not a meaningful description of the document itself. Another problem with this approach is that terms can have ambiguous meanings. For example, documents about Panthera onca, the British luxury car manufacturer, and the operating system might all be automatically classified using the term “Jaguar.” Unfortunately, a reader interested in one meaning of the term may wind up having to sift through documents pertaining to all of the other meanings as well.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.

FIG. 1 illustrates an example of an environment in which documents are clustered.

FIG. 2 illustrates an embodiment of a clustering system.

FIG. 3 illustrates an embodiment of a process for clustering documents.

FIG. 4A is a conceptual illustration of baby clusters and singletons.

FIG. 4B is a conceptual illustration of reclustering (after moving the singletons to the baby clusters).

FIG. 5 illustrates a representation of a data set comprising 25 documents that can be clustered into four clusters (with their topics) and five singletons.

FIG. 6 illustrates the data set after an embodiment of initial clustering has been performed.

FIG. 7 illustrates a graphical view of initial clustering of the data set.

FIG. 8 illustrates a graphical view of baby clusters and singletons.

FIG. 9A illustrates a graphical view of clusters after reassigning singletons is performed.

FIG. 9B illustrates documents in the data set as members of renovated clusters or as singletons after reclustering.

FIG. 9C illustrates example themes for each of the four clusters using the top shared terms of a given cluster.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or, a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

FIG. 1 illustrates an example of an environment in which documents are clustered. Corpus processor 106 is configured to collect (or otherwise receive) documents from a variety of data sources 110-114. Examples of such documents include news articles, forum messages, blog posts, and any other text (in formats such as HTML, TXT, PDF, etc.) as applicable. In the example shown, corpus processor 106 is configured to scrape content from external website 112 and to perform searches using an API made available by search engine 114, both of which are accessible via a network 108 (e.g., the Internet). Corpus processor 106 is also configured to receive documents from an internal source, such as repository 110.

In various embodiments, corpus processor 106 collects documents on demand. For example, a user of platform 116 (hereinafter “Benjamin Snyder,” an architect living in California) may initiate a request (via interface 118) for documents that pertain to him. In response to the request, corpus processor 106 obtains documents from one or more of the data sources. Corpus processor 106 can also be configured to store and periodically refresh the documents it collects regarding Benjamin Snyder, such as upon the request of Benjamin Snyder, or programmatically (e.g., once a month).

Corpus processor 106 is configured to process the collected documents and make them available to clustering system 102 as an input data set (104). As one example, in some embodiments corpus processor 106 is configured to convert the documents it receives into plaintext, or otherwise extract text from those documents, as applicable. As will be described in more detail below, clustering system 102 is configured to discover clusters in the input data set and define the theme of each cluster.

FIG. 2 illustrates an embodiment of a clustering system. In the example shown in FIG. 2, system 102 comprises standard commercially available server hardware (e.g., having a multi-core processor 202, 8G+ of RAM 204, gigabit network interface adaptor(s) 206, and one or more hard drives 208) running a typical server-class operating system (e.g., Linux). In various embodiments, system 102 is implemented across a scalable infrastructure comprising multiple such servers, solid state drives, and/or other applicable high-performance hardware. Further, as illustrated in FIG. 1, clustering system 102 can be collocated on a platform 116 with other components, such as corpus processor 106. Clustering system 102 can also be configured to work with one or more third party elements. For example, the functionality of corpus processor 106 and/or interface 118 can be provided by a third party.

Whenever clustering system 102 is described as performing a task, either a single component or a subset of components or all components of system 102 may cooperate to perform the task. Similarly, whenever a component of system 102 is described as performing a task, a subcomponent may perform the task and/or the component may perform the task in conjunction with other components. In various embodiments, portions of system 102 are provided by one or more separate devices. As one example, the functionality of preprocessing engine 210 and clustering engine 212 may be provided by two different devices, rather than the functionality being provided by a single device. Also, in various embodiments, system 102 provides the functionality of corpus processor 106 and a separate corpus processor is omitted, as applicable.

Clustering system 102 is configured to perform a variety of tasks, including preprocessing (via preprocessing engine 210) and document clustering (via clustering engine 212).

Preprocessing

Clustering system 102 receives as input a set of documents 104 (e.g., from corpus processor 106). Preprocessing engine 210 uses the data set to create a term-by-document matrix M and a keyword list (ordered by column number of M).

A text object (i.e., a document or set of documents) is represented in Latent Semantic Indexing (“LSI”) by its term frequency (“TF”). The term frequency representations of two example documents d₁and d₂, and of a collection C₁comprising those two documents, are as follows:

d₁=[2,0,3,0,0,5,0,1,4]
d₂=[1,3,0,2,0,3,0,0,3]
C₁=d₁+d₂=[3,3,3,2,0,8,0,1,7]

A concept representation of a text object can also be created, which shows whether the object contains the given term, but does not indicate the number of times the term appears in the document. Using the term frequencies for d₁, d₂, and C₁above, the concept representations are as follows:

Term Frequency Rep.→Concept Rep.

d₁=[2,0,3,0,0,5,0,1,4]→b₁=[1,0,1,0,0,1,0,1,1]
d₂=[1,3,0,2,0,3,0,0,3]→b₂=[1,1,0,1,0,1,0,0,1]
C₁=[3,3,3,2,0,8,0,1,7]→B₁=[1,1,1,1,0,1,0,1,1]

Suppose the total data set (e.g., received by clustering system 102 as input 104) has n=N_ddocuments and a vocabulary of m=N_tkeywords. Matrix M can be constructed as:

$M = [\begin{matrix} d_{11} & \dots & d_{1 m} \\ ⋮ & ⋱ & ⋮ \\ d_{n 1} & \dots & d_{nm} \end{matrix}] .$

The keyword list has N_telements and the order of the keywords in the list matches the order of columns of the TF representation of the documents.

In some embodiments, Matrix M and the keyword list are created as follows: Suppose a total of 50 documents are received as a set 104 by clustering system 102. The documents are stored in a directory of storage 208 as individual files. Preprocessing engine 210 uses a Java program to construct a single file (“doc-file”) from the documents in the directory, with each line of the doc-file representing the content of one of the 50 documents. Various rules can be employed to construct the single doc-file, such as by removing email addresses in documents having more than 5,000 characters; truncating documents that are longer than 50,000 characters; and removing any HTML tags or other strings that may erroneously remain after the processing performed by corpus processor 106. Next, the doc-file is used to generate a “mat-file” and a “mat-file.clabel” file by using the doc2mat Perl program. The mat-file includes information about Matrix M and the mat-file.clabel file lists the keywords (e.g., after a tokenization using stemming and ignoring numerical terms is performed). Finally, the mat-file and mat-file.clabel files are read to create the Matrix M and the keyword list using an appropriate Java program. In various embodiments, additional processing is performed when constructing the matrix and keyword list from the mat-file and mat-file.clabel. For example, terms that appear in only a single one of the 50 documents can be omitted.

Clustering

FIG. 3 illustrates an embodiment of a process for clustering documents. In various embodiments the process shown in FIG. 3 is performed by clustering engine 212. In some embodiments clustering engine 212 makes use of industry standard mathematical software tools such as MATLAB, which are installed on or otherwise in communication with system 102.

Initial Clustering

The process begins at 302 when a data set is clustered into one or more initial clusters using a first term set. A variety of clustering techniques can be used and the process described herein adapted as applicable. In some embodiments the clustering is performed using LSI-SVD, as follows. A Matrix Q_ais derived from Matrix M based on the following equation:

Q_a=√{square root over (c)}D_n⁻¹M,

where constant c and matrix D_nare defined as follows:

$f_{i} = \sum_{j = 1}^{m} {(M \cdot M^{T})}_{ij}, g_{i} = \frac{f_{i}}{\sum_{j} f_{j}}, c = \frac{1}{\sum_{j} f_{j}}, D_{n} = diag (\sqrt{g}) .$

Next, singular value decomposition ([U,S,V]=svd(Q_a,0)) is used to obtain and plot the coordinates of terms in a reduced two-dimensional space. Specifically, the coordinates can be derived from the m×m Matrix U:

{right arrow over (t)}_i=(x_i,y_i),x_i=U(x_i,y_i),y_i=U(i,2),iε{1,2, . . . ,m=N_i}.

The angles between terms (term-term angles) are calculated on a unit sphere:

$θ_{i, j} = \cos^{- 1} (\frac{\overset{->}{t_{i}} \cdot \overset{->}{t_{j}}}{\overset{->}{\langle t_{i} \rangle} \cdot \overset{->}{\langle t_{j} \rangle}}) .$

For each term t_i, a group of closely related terms is built, where an assumption is made that term t_iand t_jare closely related if θ is smaller than a threshold value, for example:

$θ_{i, j} < \frac{π}{4} .$

Each group is an m-dimensional concept representation.

Next, LSI-SVD is used to calculate a document-document distance matrix and plot it on a two-dimensional graph. Specifically, a weight Matrix W is built from Matrix M as follows:

$W_{μ, i} = \frac{M_{μ, i} {idf}_{i}}{\sqrt{W_{μ}}}, W_{μ} = \sum_{i = 1}^{m} {(M_{μ, i} {idf}_{i})}^{2}, idf = \log (\frac{n}{n_{i}}), n_{i} = \sum_{μ - 1}^{n} M_{μ, i} .$

Singular value decomposition ([U,S,V]=svd(W^T,0)) is used to obtain and plot the coordinates of documents in a reduced two-dimensional space. Specifically, the coordinates can be derived from the n×n Matrix V:

{right arrow over (d)}_μ=(x_μ,y_μ),x_μ=V(μ,1),y_μ=V(μ,2),με{1,2, . . . ,n}.

The distances between documents (“ddDist”) are calculated as follows:

dist(d_μ,d_ν)=|{right arrow over (d)}_μ−{right arrow over (d)}_ν|.

Finally, cluster membership is determined, using the assumption that a document belongs to a cluster (referred to as an “initial cluster” at this stage of processing) if its distance from at least one document in the cluster is less than a threshold value. In some embodiments the threshold value is an optimized radius. For example, it may be estimated as: ρ=κ×σ(min DistMin, max DistMin), where κ is a parameter based on the size of the dataset (for example, κ=3.5 for entity datasets), and σ(min DistMin, max DistMin) is an empirical function derived from the best fitting for several small datasets using the minimum and maximum of the minimum distances between documents. Documents that do not belong to any clusters are left as singletons.

At 304, a theme of each initial cluster is determined by a set of keywords. One way of determining themes is as follows. First, a set of core terms is calculated for each of the initial clusters, using the following three cases:

Case 1: Use the original terms shared by all documents in the cluster. (This case will likely apply to sets with hundreds of terms per document, or more.)

Case 2: If no terms are found in Case 1, add connected terms which are closely related to the terms in the documents, e.g., in accordance with the definition of “closely related” above. (This may be required for some small data sets.)

Case 3: If no terms are found in Case 2, use the union of the terms of all documents in the cluster. (This may be required for some very small data sets.)

Next, the concept of the cluster is calculated as the union of the core terms of the cluster and any terms closely related to those terms.

Finally, a theme representation of the entire data set is determined as follows. A term frequency representation (“clusterTF”) is constructed for each cluster in accordance with rules, such as the following: (1) the term is not shared by all cluster concepts; (2) the term is contained by at least one document in the clusters; and (3) the total term frequency of the term is not less than a threshold value “tfMin.” The concept shared by all clusters (i.e., the “theme” of the dataset) and the “theme” of each cluster, defined by the top “tfMax” (a threshold value, set to 12 by default) frequent terms based on the cluster term frequency representation are displayed.

For very small data sets, the processing performed at 302-304 of process 300 may be sufficient to achieve satisfactory clusters and theme representations. However, for more typical data sets, the existence of noise terms obscures clusters. As will now be described, renovated clusters, using a reduced term space in which noise terms are excluded, can be used to improve clustering results.

Reduced Term Space

At 306, the term space is reduced. One approach for reducing the term space is as follows. Suppose a term frequency representation for each cluster (“clusterTF”) has already been created, as described above. A reduced term space can be constructed by keeping in the term space only those terms in the clusterTFs.

Renovated Clustering

At 308, the data set is reclustered into one or more baby clusters using the reduced term space. In some embodiments, the same clustering techniques employed at 302 are used again at 308. In other embodiments, an alternate clustering technique is used. One example of such an alternate technique will now be described.

First, a new document term frequency Matrix M₂and keyword list with a reduced term space of dimension m′<m are built. Instead of using singular value decomposition, for the second round of clustering a vector space model (VSM) is used to define relations of documents. The coordinates of documents are found this time using a weight Matrix W built from Matrix M₂as follows:

$w_{μ, i} = \frac{{tf}_{μ, i} {idf}_{i}}{\sqrt{W_{μ}}}, W_{μ} = \sum_{i = 1}^{m^{'}} {({tf}_{μ, i} \times {idf}_{i})}^{2} .$

A variety of different weight formulae can be used—the preceding is provided merely as an example. Next, a document-document relevance coefficient (“ddRC”) is calculated based on the calculated weights, together with a minimum and maximum:

RC(d_μ,d_ν)=Σ_i=1^m′w_μ,iw_ν,i.

A document is assumed to belong to a cluster if its relevance coefficient (“RC”) with one of the documents in the cluster is greater than the value of “rcMin.” In some embodiments, rcMin is 0.45. Using such a value for rcMin will typically result in small clusters (also referred to herein as “baby clusters”) and several singletons (illustrated in FIG. 4A).

Reassigning Singletons

At 310, an attempt is made to reassign singletons to the baby clusters. One approach to performing this assignment is as follows: A singleton will be assigned to a given cluster if (1) the RC with that cluster is greater than its RC with any other cluster; and (2) the RC is greater than a minimum threshold value “rcS.” In some embodiments, rcS=0.05 by default.

The RC between a document d_μ and a cluster c_α is calculated as:

$RC (d_{μ}, c_{α}) = \sum_{i = 1}^{m^{'}} w_{μ, i} \frac{{tf}_{α, i}}{W_{α}}, W_{α} = \sqrt{\sum_{i = 1}^{m^{'}} {tf}_{α, i}^{2}} .$

The reassignment of singletons is conceptually illustrated in FIGS. 4A and 4B. Specifically, as shown in FIG. 4A, baby cluster 402 and baby cluster 404 each include two documents. Seven singletons (406-418) are present. The relevance (or closeness) of a singleton to each cluster is calculated. Singletons 410 and 408, and singletons 412 and 414 are respectively close enough to baby clusters 402 and 404 that they will be reassigned to those clusters (as indicated by arrows in FIG. 4A).

As illustrated in FIG. 4B, singletons 408, 410, 412, and 414 can be reassigned in compliance with the reassigning rules described above. The resulting clusters (i.e. baby clusters combined with any reassigned singletons) are referred to herein as “renovated clusters” 450 and 452.

At 312, renovated themes are determined for the renovated clusters. In various embodiments, the processing performed at 312 is the same as the processing preformed at 304.

Output

Finally, at 314 the renovated themes, renovated clusters (including any reassigned singletons), and/or any remaining singletons are provided as output 214, as applicable. The output of process 300 can be used in a variety of contexts, examples of which are described below.

System 102 can be used to help the Benjamin Snyder mentioned in conjunction with FIG. 1 to differentiate documents that are about him from other documents about other Benjamin Snyders. Documents about the user Benjamin Snyder will be clustered together and have as a theme terms such as “architect” and “California.” Documents about the other Benjamin Snyders will also be in their own respective clusters and have other themes, such as “basketball player”/“Kentucky” or “dentist”/“New York,” which do not describe the user.

As another example, a celebrity that may be commonly mentioned in news articles can use system 102 to receive news alerts about one aspect of the celebrity's life (e.g., film-making or parenting) and selectively ignore information about other aspects (e.g., a failed musical career). In this scenario, the input set could be generated as a result of a search for the celebrity on site 114, with the results being automatically clustered into the various themes.

As yet another example, a user of platform 116 can perform a search for documents residing on data source 110 pertaining to “jaguars,” and receive back results clustered based on the various disambiguations of the term.

In various embodiments, output 214 is provided to other processes, in addition to or instead of being displayed to a human user. For example, in the case of the celebrity described above, sentiment analysis can be performed by another component of platform 116 to assign scores to the distinct aspects of the celebrity's life (as determined from the clusters). In such a scenario, the celebrity might learn that articles mentioning the celebrity's film-making attempts are 70% positive, while articles mentioning the musical career are 25% positive.

EXAMPLE
“Star” Data Set

FIGS. 5-9C and the accompanying text illustrate portions of process 300 as applied to an example set of data. Specifically, suppose a search of data source 114 is performed by corpus processor 106 using the term “star” and that the first 25 results are processed by corpus processor 106 and provided to clustering system 102. The 25 documents collectively include a total of 4215 terms. Four different topics are represented in the data (each having five documents) and five documents are singletons, as illustrated in FIG. 5. When terms that occur in only a single one of the 25 documents are removed, a total of 1074 terms remain as the initial term space.

Initial Clustering

Using singular value decomposition for term-term interaction and LSI-SVD for document-document interaction, with K=4.5, the initial clusters (and no singletons) depicted in FIG. 6 are obtained. The theme of the data set is determined to be “star,” as it is the shared, connected term with the highest frequency. When a tfMin=8 and tfMax=10 are used to build a term frequency representation and generate themes, it can be seen that a total of eleven of the documents are misplaced. Some of the clusters are virtually correct (e.g., cluster 602 compared to 502). Other clusters are over inclusive (604) with no clear theme, and yet others are under inclusive (e.g., cluster 606 compared to 504). Moreover, there are no singletons. FIG. 7 illustrates the star data set, in two-dimensions, after initial clustering has been performed.

Reducing Term Space and Re-Clustering

Using the clustering term frequency representations obtained in initial clustering, a new matrix and keyword list using the reduced term space is generated. At this point, the dimension of the keyword list is now 93. Reclustering is performed using VSM with an rcMin=0.45 and, as illustrated in FIG. 8, four baby clusters (with lines connecting their member documents) and 16 singletons are found. The baby clusters are small, but are well defined and usable as cluster seeds when reassigning singletons.

Reassigning Singletons

Finally, singletons are reassigned by calculating its RC with each cluster (as in VSM) and keeping the document as a singleton if the max RC<rcS. (Shown in FIG. 9A). In the final output, only three of the 25 documents are misplaced (indicated with underlining) in FIG. 9B. In addition, each cluster has a clear theme, as shown in FIG. 9C.

Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.

Claims

1. A system, comprising: a processor configured to: cluster a data set into one or more initial clusters using a first term space comprising a plurality of keywords;determine an initial theme for each initial cluster, wherein the initial theme for each initial cluster is determined based on at least one keyword in the first term space;reduce the first term space to create a reduced term space, wherein reducing the first term space includes removing from the first term space a keyword term that is determined to be present in a first document clustered into a first initial cluster and is also determined to be present in a second document clustered into a second initial cluster, and wherein a term frequency for the keyword at least meets a predetermined threshold;recluster at least a portion of the data set into one or more baby clusters using the reduced term space, wherein after reclustering, at least one singleton is present, wherein a singleton is an element from the data set that was not assigned to any baby clusters during the reclustering;assign at least one singleton to a baby cluster to form one or more renovated clusters;determine a renovated theme for at least one of the renovated clusters; andprovide as output one or more of the renovated clusters and their respective themes; anda memory coupled to the processor and configured to provide the processor with instructions.
2. The system of claim 1 wherein a first renovated theme for a first renovated cluster comprises one or more frequent keywords that are uniquely associated with the first renovated cluster.
3. The system of claim 1 wherein the processor is configured to cluster the data set into the one or more initial clusters at least in part by using singular value decomposition on a keyword-by-document matrix and wherein the processor is configured to recluster at least in part by using a vector space model.
4. The system of claim 1 wherein the data set includes entity data.
5. The system of claim 1 wherein the processor is configured to recluster at least in part by: using a first relevance criterion to generate a set of baby clusters and a set of singletons; andusing a second relevance criterion to assign at least one singleton included in the set of singletons to a baby cluster included in the set of baby clusters, wherein the first and second relevance criteria are different.
6. The system of claim 1 wherein the processor is further configured to preprocess the data set at least in part by generating a keyword-by-document matrix.
7. The system of claim 1 wherein the processor is further configured to preprocess the data set at least in part by generating the first term space.
8. The system of claim 7 wherein the data set comprises a plurality of documents and wherein the processor is configured to generate the first term space at least in part by evaluating keywords present in the data set and not include in the first term space a solo keyword that occurs solely in a single document included in the plurality of documents.
9. The system of claim 7 wherein the processor is configured to generate the first term space at least in part by evaluating keywords present in the data set and not include in the first term space one or more keywords associated with numbers.
10. The system of claim 1 wherein at least some of the data included in the data set describes a first individual, wherein at least some of the data included in the data set describes a second individual, and wherein the output is provided to the first individual.
11. A method, comprising: clustering a data set into one or more initial clusters using a first term space comprising a plurality of keywords;determining an initial theme for each initial cluster, wherein the initial theme for each initial cluster is determined based on at least one keyword in the first term space;reducing the first term space to create a reduced term space, wherein reducing the first term space includes removing from the first term space a keyword that is determined to be present in a first document clustered into a first initial cluster and is also determined to be present in a second document clustered into a second initial cluster, and wherein a term frequency for the keyword at least meets a predetermined threshold;reclustering at least a portion of the data set into one or more baby clusters using the reduced term space, wherein after reclustering, at least one singleton is present, wherein a singleton is an element from the data set that was not assigned to any baby clusters during the reclustering;assigning at least one singleton to a baby cluster to form one or more renovated clusters;determining a renovated theme for at least one of the renovated clusters; andproviding as output one or more of the renovated clusters with their respective themes.
12. A computer program product embodied in a non-transitory computer readable storage medium and comprising computer instructions for: clustering a data set into one or more initial clusters using a first term space comprising a plurality of keywords;determining an initial theme for each initial cluster, wherein the initial theme for each initial cluster is determined based on at least one keyword in the first term space;reducing the first term space to create a reduced term space, wherein reducing the first term space includes removing from the first term space a keyword term that is determined to be present in a first document clustered into a first initial cluster and is also determined to be present in a second document clustered into a second initial cluster, and wherein a term frequency for the keyword at least meets a predetermined threshold;reclustering at least a portion of the data set into one or more baby clusters using the reduced term space, wherein after reclustering, at least one singleton is present, wherein a singleton is an element from the data set that was not assigned to any baby clusters during the reclustering;assigning at least one singleton to a baby cluster to form one or more renovated clusters;determining a renovated theme for at least one of the renovated clusters; andproviding as output one or more of the renovated clusters and their respective themes.
13. The system of claim 1 wherein the processor is further configured to: determine a theme for at least one baby cluster or singleton that was not included in any renovated clusters during the assignment; andprovide as output one or more of the baby clusters or singletons for which a theme was determined, and the determined theme.
14. The method of claim 11 wherein a first renovated theme for a first renovated cluster comprises one or more frequent keywords that are uniquely associated with the first renovated cluster.
15. The method of claim 11 wherein clustering the data set into the one or more initial clusters includes using singular value decomposition on a keyword-by-document matrix and wherein reclustering includes using a vector space model.
16. The method of claim 11 wherein reclustering includes: using a first relevance criterion to generate a set of baby clusters and a set of singletons; andusing a second relevance criterion to assign at least one singleton included in the set of singletons to a baby cluster included in the set of baby clusters, wherein the first and second relevance criteria are different.
17. The method of claim 11 further comprising preprocessing the data set at least in part by generating a keyword-by-document matrix.
18. The method of claim 11 further comprising preprocessing the data set at least in part by generating the first term space.
19. The method of claim 18 wherein the data set comprises a plurality of documents and wherein generating the first term space includes evaluating keywords terms present in the data set and not including in the first term space a solo term keyword that occurs solely in a single document included in the plurality of documents.
20. The method of claim 18 generating the first term space includes evaluating keywords present in the data set and not including in the first term space one or more keywords associated with numbers.

US Referenced Citations (143)

Number	Name	Date	Kind
5819258	Vaithyanathan et al.	Oct 1998	A
5857179	Vaithyanathan et al.	Jan 1999	A
5873081	Harel	Feb 1999	A
5987457	Ballard	Nov 1999	A
6006218	Breese et al.	Dec 1999	A
6178419	Legh-Smith et al.	Jan 2001	B1
6182066	Marques	Jan 2001	B1
6324650	Ogilvie	Nov 2001	B1
6374251	Fayyad et al.	Apr 2002	B1
6484068	Yamamoto et al.	Nov 2002	B1
6510432	Doyle	Jan 2003	B1
6513031	Fries et al.	Jan 2003	B1
6532459	Berson	Mar 2003	B1
6611825	Billheimer et al.	Aug 2003	B1
6678690	Kobayashi et al.	Jan 2004	B2
6701305	Holt et al.	Mar 2004	B1
6766316	Caudill et al.	Jul 2004	B2
6775677	Ando et al.	Aug 2004	B1
6968333	Abbott et al.	Nov 2005	B2
6985896	Perttunen	Jan 2006	B1
7028026	Yang et al.	Apr 2006	B1
7076558	Dunn	Jul 2006	B1
7117207	Kerschberg et al.	Oct 2006	B1
7289971	O'Neil et al.	Oct 2007	B1
7631032	Refuah et al.	Dec 2009	B1
7634810	Goodman et al.	Dec 2009	B2
7640434	Lee et al.	Dec 2009	B2
7653646	Horn et al.	Jan 2010	B2
7792816	Funes et al.	Sep 2010	B2
7970872	Liu et al.	Jun 2011	B2
8185531	Nakano	May 2012	B2
20020016910	Wright et al.	Feb 2002	A1
20020026456	Bradford	Feb 2002	A1
20020111847	Smith	Aug 2002	A1
20020174230	Gudorf et al.	Nov 2002	A1
20020178381	Lee et al.	Nov 2002	A1
20030014402	Sealand et al.	Jan 2003	A1
20030014633	Gruber	Jan 2003	A1
20030069874	Hertzog et al.	Apr 2003	A1
20030093260	Dagtas et al.	May 2003	A1
20030135725	Schirmer et al.	Jul 2003	A1
20030147536	Andivahis et al.	Aug 2003	A1
20030229668	Malik	Dec 2003	A1
20040019584	Greening et al.	Jan 2004	A1
20040019846	Castellani et al.	Jan 2004	A1
20040024598	Srivastava et al.	Feb 2004	A1
20040063111	Shiba et al.	Apr 2004	A1
20040064438	Kostoff	Apr 2004	A1
20040078363	Kawatani	Apr 2004	A1
20040082839	Haugen	Apr 2004	A1
20040088308	Bailey et al.	May 2004	A1
20040093414	Orton	May 2004	A1
20040122926	Moore et al.	Jun 2004	A1
20040169678	Oliver	Sep 2004	A1
20040205457	Bent et al.	Oct 2004	A1
20040220944	Behrens et al.	Nov 2004	A1
20040230577	Kawatani	Nov 2004	A1
20040267717	Slackman	Dec 2004	A1
20050005168	Dick	Jan 2005	A1
20050050009	Gardner et al.	Mar 2005	A1
20050071632	Pauker et al.	Mar 2005	A1
20050114313	Campbell et al.	May 2005	A1
20050160062	Howard et al.	Jul 2005	A1
20050165736	Oosta	Jul 2005	A1
20050177559	Nemoto	Aug 2005	A1
20050216443	Morton et al.	Sep 2005	A1
20050234877	Yu	Oct 2005	A1
20050251536	Harik	Nov 2005	A1
20050256866	Lu et al.	Nov 2005	A1
20060004716	Hurst-Hiller et al.	Jan 2006	A1
20060015942	Judge et al.	Jan 2006	A1
20060026593	Canning et al.	Feb 2006	A1
20060042483	Work et al.	Mar 2006	A1
20060047725	Bramson	Mar 2006	A1
20060116896	Fowler et al.	Jun 2006	A1
20060149708	Lavine	Jul 2006	A1
20060152504	Levy	Jul 2006	A1
20060161524	Roy et al.	Jul 2006	A1
20060173828	Rosenberg	Aug 2006	A1
20060174343	Duthie et al.	Aug 2006	A1
20060212931	Shull et al.	Sep 2006	A1
20060218140	Whitney et al.	Sep 2006	A1
20060242190	Wnek	Oct 2006	A1
20060253423	McLane et al.	Nov 2006	A1
20060253458	Dixon et al.	Nov 2006	A1
20060253578	Dixon et al.	Nov 2006	A1
20060253580	Dixon et al.	Nov 2006	A1
20060253582	Dixon et al.	Nov 2006	A1
20060253583	Dixon et al.	Nov 2006	A1
20060253584	Dixon et al.	Nov 2006	A1
20060271524	Tanne et al.	Nov 2006	A1
20060287980	Liu et al.	Dec 2006	A1
20060294085	Rose et al.	Dec 2006	A1
20060294086	Rose et al.	Dec 2006	A1
20070073660	Quinlan	Mar 2007	A1
20070101419	Dawson	May 2007	A1
20070112760	Chea et al.	May 2007	A1
20070112761	Xu et al.	May 2007	A1
20070118518	Wu et al.	May 2007	A1
20070121596	Kurapati et al.	May 2007	A1
20070124297	Toebes	May 2007	A1
20070130126	Lucovsky et al.	Jun 2007	A1
20070150562	Stull et al.	Jun 2007	A1
20070156665	Wnek	Jul 2007	A1
20070271287	Acharya et al.	Nov 2007	A1
20070271292	Acharya et al.	Nov 2007	A1
20070288468	Sundaresan et al.	Dec 2007	A1
20080021890	Adelman et al.	Jan 2008	A1
20080077517	Sappington	Mar 2008	A1
20080077577	Byrne et al.	Mar 2008	A1
20080082687	Cradick et al.	Apr 2008	A1
20080104030	Choi et al.	May 2008	A1
20080109245	Gupta	May 2008	A1
20080109491	Gupta	May 2008	A1
20080133488	Bandaru et al.	Jun 2008	A1
20080165972	Worthington	Jul 2008	A1
20080168019	Dakka et al.	Jul 2008	A1
20080281807	Bartlang et al.	Nov 2008	A1
20080288277	Fasciano	Nov 2008	A1
20080294686	Long et al.	Nov 2008	A1
20090070325	Gabriel et al.	Mar 2009	A1
20090210406	Freire et al.	Aug 2009	A1
20090307762	Cudd, Jr.	Dec 2009	A1
20100100950	Roberts	Apr 2010	A1
20100114561	Yasin	May 2010	A1
20100174670	Malik et al.	Jul 2010	A1
20100198839	Basu et al.	Aug 2010	A1
20100250515	Ozonat et al.	Sep 2010	A1
20100262454	Sommer et al.	Oct 2010	A1
20100262601	Dumon et al.	Oct 2010	A1
20100268526	Bradford	Oct 2010	A1
20100313252	Trouw	Dec 2010	A1
20110016118	Edala et al.	Jan 2011	A1
20110022597	Gallivan et al.	Jan 2011	A1
20110078049	Rehman et al.	Mar 2011	A1
20110087668	Thomas et al.	Apr 2011	A1
20110112901	Fried et al.	May 2011	A1
20110153551	Gabriel et al.	Jun 2011	A1
20110296179	Templin et al.	Dec 2011	A1
20120023332	Gorodyansky	Jan 2012	A1
20120209847	Rangan	Aug 2012	A1
20130007014	Fertik et al.	Jan 2013	A1
20130124653	Vick et al.	May 2013	A1

Foreign Referenced Citations (1)

Number	Date	Country
WO0146868	Jun 2001	WO

Non-Patent Literature Citations (8)

Entry
Liu et al., “Personalized Web Search by Mapping User Queries to Categories,” CIKM, '02, McLean, Virginia, Nov. 4-6, 2002, pp. 558-565.
PCT Notification of Transmittal of the International Search Report and the Written Opinion of the International Searching Authority for International Application No. PCT/US2012/044668, dated Dec. 21, 2012, 11pages.
Pretschner et al., “Ontology Based Personalized Search,” Proc. 11th IEEE International Conference on Tools with Artificial Intelligence, Chicago, Illinois, Nov. 1999, pp. 391-398.
Sugiyama et al., “Adaptive Web Search Based on User Profile Constructed Without Any Effort from Users,” ACM, New York, NY, May 17-22, 2004, pp. 675-684.
Xiaoyan et al., “Simultaneous Clustering and Noise Detection for Theme-based Sumarization”, http://aclweb.org/anthology/l/l11/l11-1055.pdf, Oct. 28, 2011.
Farahat et al., “Enhancing Document Clustering Using Hybrid Models for Semantic Similarity”, http://web.eecs.utk.edu/events/tmw10/slides/Farahat.pdf, 2010.
PCT Notification of Transmittal of the International Search Report and the Written Opinion of the International Searching Authority for International Application No. PCT/US2012/043392, mailed Jan. 25, 2013, 10 pages.
Daranyi et al., Svensk Biblioteksforskning; Automated Text Categorization of Bibliographic Records; Boras Academic Digital Archieve (BADA); artice peer reviewed [on-line], Hogskolan I Boras, vol. 16, Issue 2, pp. 1-14 as paginated or 16-29 as unpaginated of 47 pages, 2007 [retrieved on Nov. 6, 2012].

Thematic clustering

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

CPC

International Classifications

Disclaimer

Abstract

Description

Claims

US Referenced Citations (143)

Foreign Referenced Citations (1)

Non-Patent Literature Citations (8)