Managing large numbers of electronic documents in a data storage system can present several challenges. A typical data storage system may store several thousands of documents, many of which may be related in some way. For example, in some cases, a document may serve as a template which various people within the enterprise adapt to fit existing needs. In other cases, a document may be updated over time as new information is acquired or the current state of knowledge about a subject evolves. In some cases, several documents may relate to a common subject and may borrow text from common files. It may sometimes be useful to be able to trace the evolution of a stored document. For example, it may be useful to identify a more recent, or “fresh,” document that represents more up-to-date information regarding a particular concept. However, it will often be the case that the documents in the data storage system have been duplicated and edited over time without keeping any record of the version history of the document.
Certain exemplary embodiments are described in the following detailed description and in reference to the drawings, in which:
As used herein, the term “exemplary” merely denotes an example that may be useful for clarification of the present invention. The examples are not intended to limit the scope, as other techniques may be used while remaining within the scope of the present claims. Exemplary embodiments of the present invention provide techniques for identifying an electronic file, or “document,” that provides more recent, or even the most recent, information regarding a particular subject matter. The identified document may be referred to herein as a “fresh” document in a collection of documents. To identify the fresh document in the collection, a user may select a query document from among a plurality of documents in a document set and initiate a freshness query to identify derivative documents in the document set based on the textual similarity of the documents.
Derivative documents may be versions of the query document, documents that includes subject matter from the query document, or documents that discuss the same concepts as the query document. Furthermore, derivative documents may be identified even if a record of the evolution of the documents has not been maintained. The fresh document may be one of the more recent derivative documents, for example, the most recent derivative document. In some exemplary embodiments, derivative documents may be identified using a data mining technique known as “clustering.” Furthermore, to reduce the processing resources used to identify the derivative documents, a two-stage clustering algorithm may be used. As used herein, the term “automatically” is used to denote an automated process performed, for example, by a machine such as the computer device 102. It will be appreciated that various processing steps may be performed automatically even if not specifically referred to herein as such.
The client system 102 can have other units operatively coupled to the processor 112 through the bus 113. These units can include tangible, machine-readable storage media, such as a storage system 122 for the long-term storage of operating programs and data, including the programs and data used in exemplary embodiments of the present techniques. The storage system 122 may include, for example, a hard drive, an array of hard drives, an optical drive, an array of optical drives, a flash drive, or any other tangible storage device. Further, the client system 102 can have one or more other types of tangible, machine-readable storage media, such as a memory 124, for example, which may comprise read-only memory (ROM) and/or random access memory (RAM). In exemplary embodiments, the client system 102 will generally include a network interface adapter 126, for connecting the client system 102 to a network 128, such as a local area network (LAN), a wide-area network (WAN), or another network configuration. The LAN can include routers, switches, modems, or any other kind of interface device used for interconnection.
Through the network interface adapter 126, the client system 102 can connect to a server 130. The server 130 may enable the client system 102 to connect to the Internet 132. For example, the client system 102 can access a search engine 134 connected to the Internet 132. In exemplary embodiments of the present invention, the search engine 134 can include generic search engines, such as GOOGLE™, YAHOO®, BING™, and the like. In other embodiments, the search engine 134 may be a specialized search engine that enables the client system 102 to access a specific database of documents provided by a specific on-line entity. For example, the search engine 134 may provide access to documents provided by a professional organization, governmental body, business entity, public library, and the like.
The server 130 can also have a storage array 136 for storing enterprise data. The enterprise data may provide a document resource to the client system 102 by including a plurality of stored documents, for example, Adobe® Portable Document file (PDF) documents, spreadsheets, presentation documents, word processing documents, database files, MICROSOFT® Office documents, Web pages, Hypertext Markup Language File (HTML) documents, eXtensible Markup Language (XML) documents, plain text documents, electronic mail files, optical character recognition (OCR) transcriptions of scanned physical documents, and the like. Furthermore, the documents may be structured or unstructured. As used herein, a set of “structured” documents refers to documents that have been related to one another by a tracking system that records the evolution of the documents from prior versions. However, in embodiments in which the documents are structured, the recorded relationship between documents may be ignored.
Those of ordinary skill in the art will appreciate that business networks can be far more complex and can include numerous servers 130, client systems 102, storage arrays 136, and other storage devices, among other units. Moreover, the business network discussed above should not be considered limiting as any number of other configurations may be used. Any system that allows the client system 102 to access a document resource, such as the storage array 136 or an external document storage, among others, should be considered to be within the scope of the present techniques.
In exemplary embodiments of the present invention, the memory 124 of the client system 102 may hold a document analysis tool 138 for analyzing electronic documents, for example, documents stored on the storage system 122 or storage array 136, documents available through the search engine site 134, or any other document resource accessible to the client system 102. Through the document analysis tool 138, the user may select a document, referred to herein as a “query document,” and initiate a freshness query. Pursuant to the freshness query, the document analysis tool identifies documents that are derivatives of the query document. As used herein, a derivative document is a document that is textually similar to the query document, for example, a revision of the query document, a document that incorporates textual subject matter from the query document, and the like. The most recent document among the derivative documents may be identified as the fresh document with respect to the query document.
As discussed further below with regard to
In some embodiments, the document set may include files that are co-located with the query file, for example, in the same file directory, disk drive, disk drive partition, and the like. In some embodiments, the user may define the document set, for example, by selecting a particular file directory or disk drive. Furthermore, the user may define the document set as including files with a common file characteristic, for example, the same file type, the same file extension, a specified string of characters in the file name, files created after a specified data, and the like. In some embodiments, the document set may be defined automatically based on the location of the query document, the type of query document, and the like. For example, upon selecting a PDF document in a particular directory, the document set may be automatically defined as including all PDF documents in the same directory.
At block 204, a feature vector may be generated for each document in the document set, including the query document. The feature vector may be used to compare the textual content of the documents and identify similarities or dissimilarities between documents. The feature vector may be generated by scanning the document and identifying the individual terms or phrases, referred to herein as “tokens,” occurring in the document. Each time a token is identified in the document, an element in the feature vector corresponding to the token may be incremented. Each element in the feature vector may be referred to herein as a “token frequency.” Each feature vector may include a token frequency element for each token represented in the document set. The feature vector of a document may be represented by the following formula:
VDtf:=(tf1, tf2, . . . , tfT)
In the above formula, VDtf refers to the frequency with which the tth term in the document set occurs in the document and T equals the total number of tokens in the document set.
In some exemplary embodiments, each token frequency of the feature vector is be multiplied by a global weighting factor that corresponds with a characteristic of the entire document set. The same global weighting factor may be applied to the feature vector of each document in the document set. In some embodiments, the global weighting factor may be an inverse document frequency (idf), which is the inverse of the fraction of documents in the document set that contain a given token. In such embodiments, the resulting weighted feature vector may be represented by the following formula:
In the above formula, VDtf-idf is the feature vector multiplied by the inverse document frequency, |U| equals the number of documents in the document set, and dft is the number of documents in the document set that contain the tth token. Additionally, each of the weighted token frequencies of the weighted feature vector may be normalized to have unit magnitude, for example, a magnitude between 0 and 1.
At block 206, the documents in the document set may be grouped into coarse clusters based on a degree of textual similarity between the documents. To determine the degree of textual similarity between the documents, a similarity value may be computed for each pair of feature vectors generated for the documents in the document set. To group the documents into coarse clusters, the feature vectors corresponding to the documents may be processed by a clustering algorithm that segments the documents in the document set into a plurality of coarse clusters based on the similarity value. In some exemplary embodiments, the similarity value may be a Cosine similarity computed according to the following formula:
In the above formula, s(Di, Dj) represents the similarity value for the documents Di and Dj, VD
Any suitable clustering algorithm may be used to group the selected documents into coarse clusters, for example, a k-means algorithm, a repeated bisection algorithm, a spectral clustering algorithm, an agglomerative clustering algorithm, and the like. These techniques may be considered as either additive or subtractive. The k-means algorithm is an example of an additive algorithm, while a repeated-bisection algorithm may be considered as an example of a subtractive algorithm.
In a k-means algorithm, a number, k, of the documents may be randomly selected by the clustering algorithm. Each of the k documents may be used as a seed for creating a cluster and serve as a representative document, or “cluster head,” of the cluster until a new document is added to the cluster. Each of the remaining documents may be sequentially analyzed and added to one of the clusters based on a similarity between the document and the cluster head. Each time a new document is added to a cluster, the cluster head may be updated by averaging the feature vector of the cluster head with the feature vector of the newly added document.
In a repeated-bisection algorithm, the documents may be initially divided into two clusters based on dissimilarities between the documents, as determined by the similarity value. Each of the resulting clusters may be further divided into two clusters based on dissimilarities between the documents in each cluster. The process may be repeated until a final set of clusters is generated.
Furthermore, to generate the coarse clusters a coarse granularity, N, may be determined. The coarse granularity, N, represents an average cluster size, in other words, an average number of documents that may be grouped into the same coarse cluster by the clustering algorithm. The coarse granularity may be determined based on the number of documents in the document set and the expected processing time that may be used to generate the fine clusters during the second clustering stage, which discussed below in reference to block 210. For example, if the document set includes 15,000 documents, the coarse granularity, N, may be set to a value of 1000. In this hypothetical example, the clustering algorithm will generate 15 coarse clusters, and each coarse cluster may include an average of approximately 1000 documents. In some embodiments, the coarse granularity may be specified by a user. In some embodiments, the coarse granularity may be automatically determined by the clustering algorithm as a fraction of the number of documents in the document set and depending on the processing resources available to the client 102.
At block 208, a target coarse cluster may be identified. The target coarse cluster is the coarse cluster generated in block 206 that includes the query document. In some embodiments, the size of the target coarse cluster may be evaluated to determine whether the size of the target coarse cluster is approximately equal to the coarse granularity, N. Depending on the available processing resources of the client 102, a target coarse cluster that is too large may result in a long processing time during the generation of the fine clusters at block 210. Thus, if the coarse cluster includes a number of documents that is approximately three to five times greater than the specified coarse cluster granularity, N, then the block 206 may be repeated with a smaller granularity to reduce the size of the target coarse cluster. Blocks 208 and 210 may be iterated until the size of the target coarse cluster is approximately equal to or smaller that the originally specified coarse cluster granularity, N. After obtaining the target coarse cluster and verifying the size of the target coarse cluster, the process flow may advance to block 210.
At block 210, the documents included in the target coarse cluster may be grouped into fine clusters based on the degree of textual similarity between the documents. The generation of the fine clusters may be accomplished using the same techniques described above in relation to block 206, using a fine granularity, n. The fine granularity, n, represents an average size of the fine clusters, in other words, an average number of documents that may be grouped into each fine cluster by the clustering algorithm. The fine cluster size, n, may be specified based on an estimated number of documents that may be expected to be derivatives of the query document. For example, the fine granularity, n, may be specified based on an estimated number of revisions of the query document or an estimated number of documents that incorporate subject matter from the query document. For example, if the query document is a research paper, it may be estimated that the number of derivative documents may be less than 50. Thus, in this hypothetical example, the fine granularity, n, may be specified as 50. In another hypothetical example, the query document may be a financial statement. In this case, it may be expected that there exists a greater number of derivative documents, for example, 100 to 150. In other embodiments, the fine granularity may be five to ten documents. In some embodiments, the fine granularity may be specified by a user. In other embodiments, the fine granularity may be automatically determined by the clustering algorithm using a set of heuristic rules based on document type.
The resulting fine clusters may include documents that have a high degree of similarity with each other. The high degree of similarity of the documents in each fine cluster may indicate a high degree of likelihood that newer documents in the target fine cluster may have been derived from the older documents. After generating the fine clusters, the process flow may advance to block 212.
At block 212, a target fine cluster may be identified. The target fine cluster is the fine cluster generated in block 210 that includes the query document. Thus, the target fine cluster may include most or all of the documents that are similar enough to the query document to be considered a derivative document. In some embodiments, the size of the target fine cluster may be evaluated to determine whether the size of the target fine cluster is approximately equal to the fine granularity, n. If the target fine cluster that is too large this may indicate that a number of documents in the fine cluster are not derivative documents. Thus, if the fine cluster includes a number of documents that is approximately three to five times greater than the specified fine cluster granularity, n, block 210 may be repeated with a smaller granularity to reduce the size of the target fine cluster. Blocks 210 and 212 may be iterated until the size of the target fine cluster is approximately equal to or smaller that the originally specified fine cluster granularity, n. After obtaining the target fine cluster and verifying the size of the target fine cluster, the process flow may advance to block 214.
At block 214 the documents in the target fine cluster may be ordered according to time. The document order may be used to identify derivative documents that were created or modified at a later time compared to the query document. The time associated with a document may be determined from date and time information included in metadata associated with the document. For example, the time may include a date and time that the document was created, last modified, and the like. In some embodiments, documents that precede the selected document may be ignored. The most recent document in the target fine cluster may be identified by the data analysis tool as the fresh document. In some exemplary embodiments, the documents in the target fine cluster may be further analyzed to reduce the number of documents in the target fine cluster that are considered to be derivative documents. For example, a similarity metric, such as the cosine similarity discussed above, may be generated for each pair of documents in the target fine cluster, based on the feature vectors associated with each document. The similarity metric may be used to further limit the number of documents that are considered to be derivative documents. For example, a specified number or percentage of the more similar documents may be identified as derivative documents, while the remaining documents may be ignored.
In some exemplary embodiments, the process described in blocks 202 to 214 may be repeated with one of the documents in the target fine cluster used as a new query document. Upon selecting the new query document and initiating a new freshness query, the documents of the target coarse cluster previously identified at block 208 may be re-grouped into new fine clusters using the new query document. In this way, the new target fine cluster may include a new sub-set of documents, and the fresh document may be identified as one of the documents in the new target fine cluster. Furthermore, to increase the likelihood that the new target fine cluster will include documents highly related to the new query document, the feature vectors for each document in the target coarse cluster may be re-computed. For example, the token frequencies of each feature vector may be weighted more heavily for those tokens of interest that occur frequently in the new query document. In this way, the clustering algorithm may be more likely to treat the new query document as the cluster head, which may result in a new grouping of documents around the new query document. In some embodiments, the document used as the new query document may be selected by the user. In other embodiments, the process described in block 202 to 214 may be iteratively repeated for each one of the documents in the target fine cluster to generate a chain of related documents. For example, one or more documents in the target fine cluster may be identified as corresponding with the same fresh document, which may indicate that the documents were merged into a single document.
At block 216, the document analysis tool may generate a query response that includes the fresh document. The query response may identify the fresh document as well as other documents in the target fine cluster. The query response may be used to generate a visual display viewable by the user, for example, a graphical user interface (GUI) generated on the display 114 (
The visual display may also enable the user to select a specific one of the derivative documents to, for example, initiate another freshness query using the selected document, view the contents of the selected document in a document viewer, and the like. In some embodiments, the visual display may represent the derivative documents with file icons that are spatially organized based on the identified relationships between the documents. For example, arrows between the file icons may be used to identify the document evolution, documents mergers, and the like.
As shown in
Although shown as contiguous blocks, the modules can be stored in any order or configuration. For example, if the tangible, machine-readable medium 300 is a hard drive, the software components can be stored in non-contiguous, or even overlapping, sectors. Additionally, one or more modules may be combined in any suitable manner depending on design considerations of a particular implementation. Furthermore, modules may be implemented in hardware, software, or firmware.