The present invention relates generally to the comparison of documents, and in particular, to the comparison of documents for identifying documents which are similar to a source document.
Document comparison and identification is commonly used for electronic discovery purposes to identify documents relevant to a particular issue, and to trace the movements of these documents. Due to the often large data sets involved, it is impossible to manually compare and identify each of the documents of the data set. Automated data culling techniques have therefore been developed to create a smaller sub-set of the large data set of documents, which sub-set can then be manually reviewed. Among the known data culling techniques are deduplication, near-deduplication, keyword searching, and file extension searching.
Deduplication identifies and groups files that are identical to each other. Deduplication techniques involve the use of hashing to create hash values for each document in the data set. The mathematical algorithms used in hashing ensure, with a large probability, that each hash value will be unique to a document. Two or more documents having the same hash value can hence be determined to be identical copies of each other. Deduplication techniques may, for example, employ MD5 hashes. An MD5 hash is calculated for each document in a data set, and the MD5 hashes of each document are compared to locate identical documents.
Near-deduplication attempts to identify similar documents by searching the contents of documents for documents containing similar words, and/or similar placement of words.
Keyword searching involves searching the contents of documents for the existence or absence of predetermined keywords. Advance keyword searching techniques allow for the collocation of words, wildcards, and the like, to be considered.
File extension searching involves searching for files of a certain extension, assuming that the extensions are representative of the file format.
The above methods suffer from a number of deficiencies however. Deduplication, for example, only locates identical documents. Documents of the same literary content but saved in different formats, for example, would not be found by a deduplication method. Different versions of a document, such as draft versions, revisions, final versions, and so forth, would also not be found by a deduplication search.
Near-deduplication, on the other hand, whilst able to some extent to identify documents of similar content, is limited to text documents. Non-text documents such as MPEG or Audio files, TIFF and non-searchable PDF versions of text files hence cannot be identified.
Keyword searching tends to return a large number of irrelevant documents, or too few documents if the keywords used are too restrictive. Keyword searching further determines the similarity of documents based predominantly on the number of keywords matched, which is not always the best indication of similarity, particularly if searching documents in the same subject area, industry, from the same organisation, and the like. The effectiveness of keyword searching is also very much dependent on the skill of the searcher.
File extension searching returns files of the same extension, the number of which is often still prohibitively large. Furthermore, file extension searching is based on the unreliable assumption that a file's extension is indicative of the format of the file and the general content of the file (e.g. text, graphic, video, etc). Moreover, some file systems do not require files to have extensions.
None of the above techniques offer a sufficient measure of confidence to a user that substantially all relevant documents have been found, without at the same time returning a large number of documents that each have to be manually reviewed. A technique that could identify not just identical documents, but also similar and relevant documents such as various revisions of the same document, different formats of the same document, and the like, would be particularly advantageous.
According to an aspect of the present invention, there is provided a document comparison and identification method. The method comprises the steps of: identifying, in a source document, words of a predetermined number of characters or greater; generating a list containing the identified words, and excluding identified words occurring with a predetermined frequency or greater throughout a set of documents to be searched; searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; for each of the plurality of documents, determining how many identified words from the list occur in the document; and calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
According to another aspect of the present invention, there is provided a document comparison and identification method that comprises the steps of: performing a first search to identify documents identical to a source document; performing a second search to identify documents having an identical or a similar document name to the source document; performing a third search to identify documents of similar content to the source document; determining a ranking for the results of each of the first, second, and third searches; and presenting results of the first, second, and third searches in accordance with the determined ranking.
According to another aspect of the present invention, there is provided a document comparison and identification apparatus comprising: a memory unit for storing data and program instructions; and a processing unit coupled to the memory unit. The processing unit is programmed to: identify, in a source document, words of a predetermined number of characters or greater; generate a list containing the identified words, and exclude identified words from the list that occur with a predetermined frequency or greater in a set of documents to be searched; search each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; determine, for each of the plurality of documents, how many identified words from the list occur in the document; and calculate a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches
According to another aspect of the present invention, there is provided a document comparison and identification apparatus, comprising: a memory unit for storing data and program instructions; and a processing unit coupled to the memory unit. The processing unit is programmed to: perform a first search to identify documents identical to a source document; perform a second search to identify documents having an identical or a similar document name to the source document; perform a third search to identify documents of similar content to the source document; determine a ranking for the results of each of the first, second, and third searches; and present results of the first, second, and third searches in accordance with the determined ranking.
According to another aspect of the present invention, there is provided a computer program product comprising a computer readable medium comprising a computer program recorded therein for document comparison and identification. The computer program product comprises: computer program code means for identifying, in a source document, words of a predetermined number of characters or greater; computer program code means for generating a list containing the identified words, and excluding identified words from the list that occur with a predetermined frequency or greater in a set of documents to be searched; computer program code means for searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; computer program code means for, for each of the plurality of documents, determining how many identified words from the list occur in the document; and computer program code means for calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
According to another aspect of the present invention, there is provided a computer program product comprising a computer readable medium comprising a computer program recorded therein for document comparison and identification. The computer program product comprises: computer program code means for performing a first search to identify documents identical to a source document; computer program code means for performing a second search to identify documents having an identical or a similar document name to the source document; computer program code means for performing a third search to identify documents of similar content to the source document; computer program code means for determining a ranking for the results of each of the first, second, and third searches; and presenting results of the first, second, and third searches in accordance with the determined ranking.
Aspects the present disclosure are described with reference to the following drawings:
Disclosed herein is a document comparison method and apparatus for identifying documents matching search criteria, and ranking documents based on their similarity to the search criteria. The search criteria may, for example, comprise one or more of a user inputted item of information such as a keyword, date, name, and the like, or may be another document. As used herein, the term document refers to computer readable files in general and include, for example, text documents, graphic files, video files, emails, music files, binary files in general, and the like.
According to an embodiment in the present disclosure, one or more documents are provided as an input. Typically, this input is an archive file or set containing a plurality of documents therein. Examples of such archive files include, but are not limited to, Microsoft™ Outlook PST files, Microsoft™ Exchange Server EDB files, Lotus™ Notes NSF files, and the like. The archive file is processed, and a database or other index comprising an organized representation of the whole or partial contents of the archive file, characteristics and other relevant information of the contents of the archive file, and the like, is created. The database is used to effect comparison and identification of the documents contained in the archive file, and searching of the contents of the archive file in general.
A first aspect of the present disclosure is described with reference to
At step S110, a first search performs an identicality matching search on the archive file or database for documents matching the source document. This search utilizes techniques such as MD5 hashing techniques to identify documents that are bit wise identical to the source document. Documents that may have different file names, but are otherwise identical in content, will be identified as identical by the identicality matching search.
At step S120, a second search is performed on the archive file or database to identify documents that have the same or a similar document name as that of the source document.
At step S130, documents identified by either or both of the searches performed in steps S110 and S120 are considered to be similar to the source document and are assigned a similarity ranking of ‘High’.
At step S140, a third search function performs a similarity search to locate documents in the archive file which are similar in content to the source document. The similarity search is based on the contents of the documents in the archive file. The similarity search is described in greater detail hereinafter with reference to
Referring to
At step S220, of the identified words having 6 or more characters, words that appear with a predetermined frequency or greater throughout the archive file are disregarded/excluded. The remaining list of identified words forms a Relevant Word List. The total number of words in the Relevant Word List is denoted by T. The predetermined frequency may be determined according to a tf-idf (term frequency—inverse document frequency) weight, for example.
At step S230, the relevant words contained in the Relevant Word List are searched for in each document in the archive file. The number of relevant words appearing in a particular document is denoted by Y.
Whether a document is similar, and/or how similar the document is, is determined at step S240 in accordance with a number of matching relevant words Y found in the document, a minimum required number of matches M, a similarity ranking X, and a constant coefficient N. The minimum required number of matches M for a given similarity X is determined as follows:
where T≦N: M=T
For a source document M=Floor (((T−N)*X)+N)
where T>N:
where:
The inventors have found that a value of N=5 is preferable.
The document has:
Steps S230 to S240 are repeated, at step S250, until all documents in the archive file have been considered or processed.
It should be noted that for an archive file for which a database or index representative of the archive file has been created, the iteration of steps S230 to S250 may be replaced by a single step of querying the database/index for documents containing M relevant words. In this case, steps S230 to S250 of
When all the documents in the archive file have been considered, at step S250, processing returns to step S150 of
Returning to
Each identified document is denoted on the event map by an indicia 330, for example a dot or rectangle. Preferably, the indicia 330 are colour coded to facilitate interpretation of the event map. For example, identified documents having an exact MD5 match and file name match may be displayed by red indicia, while identified documents having an exact MD5 match but with a different file name may be displayed by pink indicia. A further colour may be used to identify documents of the same content but of different format, while yet a further set of colours may be used to identify documents of a certain similarity (e.g., blue for high similarity, purple for medium similarity, etc.).
The event map 300 is preferably interactive such that a user may perform a drill down action on the event map 300 to obtain more detailed information. For example, an indicia may be double clicked (e.g., using a computer pointing device) to display the document represented by the indicia, the document's chain of custody, attachments, metadata, and the like. Additionally, a user may also click an indicia of a certain colour to perform a process on all indicia of the same colour, such as to list all documents of the same similarity, export such documents, and the like.
A selection box A140 may be generated (e.g., by a user) on the event map 300 to obtain detailed information on the documents represented by the indicia within the selection box A140, or to perform processes thereon. Such processes may, for example, include an export process, review process, listing, and the like.
The event map 300 is not limited to a 2-dimensional graphical representation as shown in
An embodiment of the present invention provides a document comparison and identification method comprising the steps of: identifying, in a source document, words of a predetermined number of characters or greater; generating a list containing the identified words, and excluding identified words from the list that occur with a predetermined frequency or greater in a set of documents to be searched; searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; for each of the plurality of documents, determining how many identified words from the list occur in the document; and calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
The predetermined number of characters may be 6. The predetermined minimum required number of matches may be calculated according to the formula:
M=Floor (((T−N)*X)+N)
A document may be determined to have high similarity with the source document if the number of identified words in the list occurring in the document is greater than, or equal to, the predetermined minimum required number of matches when X=0.9. Furthermore, a document may be determined to have medium similarity with the source document if the number of identified words in the list occurring in the document is greater than, or equal to, the predetermined minimum required number of matches when X=0.7. Furthermore, a document may be determined to have low similarity with the source document if the number of identified words in the list occurring in the document is greater than, or equal to, the predetermined minimum required number of matches when X=0.5. Furthermore, a document may be determined not to be similar to the source document if the number of identified words in the list occurring in the document is less than the predetermined minimum required number of matches when X=0.5. The predetermined minimum required number of matches may be determined to be equal to the number of identified words in the list.
An embodiment of the present invention provides a document comparison and identification method comprising the steps of: performing a first search to identify documents identical to a source document; performing a second search to identify documents having an identical or a similar document name to the source document; performing a third search to identify documents of similar content to the source document; determining a ranking for the results of each of the first, second, and third searches; and presenting results of the first, second, and third searches in accordance with the determined ranking. The documents identified by the first and second searches may be deemed to have a high similarity ranking. The third search may be performed in accordance with a document comparison and identification method described hereinbefore and specifically with the embodiment of the document comparison and identification method described immediately hereinbefore.
The document comparison methods described hereinbefore may be implemented using a computer system, such as the computer system described hereinafter with reference to
As shown in
The computer module D110 typically includes at least one processor unit D115, and a memory unit D190, for example formed from semiconductor random access memory (RAM) and read only memory (ROM). The module D110 also includes a number of input/output (I/O) interfaces including an audio-video interface D200 that couples to the video display D150, an I/O interface D260 for the keyboard D120 and mouse D130, and an interface D210 for the external modem D160 and printer D140. The computer module D110 may also have a local network interface D240 which, via a connection D330, permits coupling of the computer system D100 to a local computer network D320. As also illustrated, the local network D320 may also couple to the wide network D170 via a connection D340. The interface D240 may be formed by an Ethernet™ circuit card, a wireless Bluetooth™ or an IEEE 802.11 wireless arrangement, and the like.
Storage devices D220 are provided and typically include a hard disk drive D230 and an optical disk drive D250.
The steps of the methods described hereinbefore may be implemented as software, such as one or more application programs executable within the computer system D100. In particular, the steps of the methods described hereinbefore with reference to
In executing the software instructing the computer system D100 to perform one or more of the steps illustrated in
According to one or more aspects of the present disclosure, a number of different search methods are employed in combination. In employing a number of different search methods in combination, a more comprehensive search may be performed. For example, similar documents may be identified by having identical or similar document names, or identical MD5 hash values. This is particularly effective when searching non-text documents. When searching text documents, the hereinbefore described similarity search may also be employed to identify similar documents. In contrast, searches employing only near-deduplication or keyword searching, for example, are able to search only text documents, while searches employing only deduplication searches such as those involving hashing techniques are unable to identify documents of similar literary content.
Moreover, conventional search techniques such a deduplication and near-deduplication are generally utilized to exclude documents. In contrast, the document comparison methods of the present disclosure may be used to identify documents similar to a given relevant document.
Additionally, by ranking identified documents, for example with High, Medium, and Low rankings, confidence that substantially all relevant documents have been located/identified in a search can be instilled in a user. Further, by graphically representing the similarity of documents, relevant documents can be easily identified and selected for review.
The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.
Number | Date | Country | Kind |
---|---|---|---|
2008900543 | Feb 2008 | AU | national |
The present application claims priority from U.S. Provisional Patent Application No. 61/063,757 filed on 5 Feb. 2008 and Australian Provisional Patent Application No. 2008900543 filed on 5 Feb. 2008. The entire disclosure of U.S. Provisional Patent Application No. 61/063,757 and Australian Provisional Patent Application No. 2008900543 are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61063757 | Feb 2008 | US |