1. Field of the Invention
The present invention relates to document management and, more particularly, to locating text or information in a document.
2. Brief Description of Related Developments
The problem is matching of text expected to be found at a certain location within a document. It is necessary to allow the movement of the text and the alteration of the text and still be able to match the text. Prior to the invention, text could be matched character-by-character with each section of text in the document. Character-by-character matching of text does not allow for the text to be altered.
The present invention is directed to searching for text in a document. In one embodiment, the method includes comparing a signature of text to be located with a signature of each section of text in the document. A distance from an expected location of the text to be matched is computed and compared to a location of each section of text in the document. An exact match of the signature of text to be located that is nearest to the expected location of the text to be located is sought. If an exact match of the signature is not found at the expected location, a close match to the signature, that is nearest to the expected location, is sought. If the exact match is found, the location of the exact match is identified as the location of the text being searched for. If the exact match is not found, and a close match is identified, the close match is identified as the location of the text being searched for. If a close match is not identified, the search is unsuccessful and the text can be considered as an orphan by the application using the invention.
In another aspect, the present invention is directed to a method of matching a section of text to be located to the existing text in a document. In one embodiment, a signature is created for the section of text to be located. A signature is then created for each section of existing text in the document. The signature can include a number of elements in a pre-determined order. A first element position or set of positions can be assigned for each letter of an alphabet of a language of the text and the numeric value of the element can identify a number of occurrences of the letter in the section for which the signature is being created. Another element position or set of positions can be used to identify a number of occurrences of any numeric in the section for which the signature is being created. A further element position or set of positions can be used to identify a number of occurrences of any separator in the section for which the signature is being created. A part score is calculated for each signature by summing the value of the element positions. A part score for the text to be matched is compared, in turn, with the part score for each section of text in the document. It is determined whether or not there is an exact match of part scores. A distance from an expected location of the text to be matched in the document is compared with the location of each section of text in the document. This can include providing each segment of the document with a sequence number, with the initial value starting at the beginning of the document. The distance between the two segments is generally the distance between the sequence numbers. Any exact match of the part score of the text to be matched to the part score of any section of text in the document is identified. If the location of the exact match is at the expected location of the text being sought, the text sought to be matched is identified as being matched. If an exact match of locations is not found, but an exact match of part scores is found, the location of a section of text in the document that has a matching part score that is nearest in distance to the expected location of the text to be matched, is identified as the location of the text sought to be match. If an exact match is not identified, a close match is sought. At least one close match of part scores is identified and the close match that is nearest in distance to the expected location of the text to be matched is the identified as the location of the text sought to be matched. If a close match cannot be identified, the search is considered unsuccessful. A segment can thus be considered orphaned if a close match, based on a threshold defined by the implementor, is not found.
In a further aspect, the present invention is directed to a method for locating data in a document. In one embodiment the method includes calculating a signature for the data corresponding to a marker in a first version of the document. In a second version of the document, a signature is calculated for each block of data in the second version. The signature of the data from the first version is compared with each signature calculated in the second version. Any exact match of signatures is identified. In the second version of the document, a distance is computed from an expected location of the signature for the data corresponding to the marker in the second version of the document to any matching signature identified. A marker is posted in the second version of the document at a location corresponding to location of any matching signature that is nearest to the expected location.
The foregoing aspects and other features of the present invention are explained in the following description, taken in connection with the accompanying drawings, wherein:
Referring to
As shown in
The present invention allows text in a document to be located even if the document is modified from its original form, such as if for example, text is added to or deleted from the document. In one embodiment this includes creating a signature for a section of text that needs to be matched and a signature for each section of text in the document being searched. In one embodiment, a signature can be made up of, for example, 28 elements, one for each letter of the alphabet, one for any numeric character and one for any separator (e.g. space, tab). In alternate embodiments, the signature can be made up of any suitable number of elements.
One embodiment of a method for calculating a signature for text is illustrated in
The process then moves to count any letters in the section. If the character is the letter A 212, the letter A count is incremented 213. A similar process and counted can be performed for each letter of the alphabet being used, up to and including the last letter 214 of the particular alphabet and a corresponding counter 215. For purposes of explanation of the present invention, the English language alphabet is illustrated, however in alternative embodiments, any suitable alphabet can be used with a corresponding number of elements and element counters. Similarly, counters can be set up for any desired characters, such as for example, punctuation, brackets and symbols. If the character is not one that has been assigned an element space and counter as described with reference to
For example, referring to
Referring to
A signature is calculated 402 for each section of text in the document. The signature of the text to be located or matched is also calculated 404. The expected location of the text to be matched is calculated 406 and the position or location of each section of text in the document is determined 408. A comparison 410 is then made between the signature of the text to be matched and the signature of the section at the expected location of the text to be matched. If an exact match is found 412, the text is found 418. If an exact match is not found, a distance is computed 414 between an expected location of the text and the location of each section of text in the document. It is determined 416 whether a close match can be Identified (in comparison scoring) which is nearest to the expected location of the text to be matched. A close match can be a factor of the correspondence in signatures and the proximity in distance of the close match to the expected location of the text. If a close match is determined 416, that location is identified 418 as the location of the text to be matched. A close match might be a section of text that has an identical signature to the text to be matched that is nearest to the expected location of the text. A close match might also include a section of text that has a signature that is comparatively similar to the signature of the text to be matched and is nearest to the expected location of the text. Generally, any suitable pre-defined parameters can be used to define a close match, and could include allowing for certain variances in the number of each of the elements that make up the signature or the total score of the signature, for example. The present invention is not intended to be limited by the scope of the definition of a close match.
The tolerance level for determining an acceptable close match can be factored into the algorithm that compares two signatures.
If a close match is not found 416, the search is rendered unsuccessful 420. This can be an appropriate state for text that has been altered beyond recognition. This text might be considered orphaned by the application, or not matchable.
With reference to
One example of this formula or algorithm may be described as a pseudo code as follows in Table 1:
num2 is the larger of the two;
Referring again to
However, if the change to the text of the document is too substantial, for example if the entire sentence or section has been rewritten, then a match will not be found 420.
The present invention can be useful when annotations are associated with document sections. The annotations need to be able to associate themselves with the section of text to which they belong, even if the text changes somewhat or moves. One embodiment of the use of annotations in a document is illustrated with respect to
In one embodiment, referring to
If an annotation is moved, such as for example, in a “cut and paste” operation, the old signature is discarded 808. The signature of the section to which the annotation is moved is applied 810, or anchored. Anchoring, as that term is used herein, generally refers to fixing the annotation in the general area of the section of text to which the note applies, the signature of where you are anchored to and the location of where you expect to be.
Referring to
If an exact match in signatures cannot be identified, it is determined whether or not there is a close match 908 of signatures (in comparison scoring). Each close match is paired 914 with a calculated section location distance 904. The close match whose section location is nearest to the expected location is acceptable and can be considered the location of the section of text to be matched. The match is identified 916 and the annotation is anchored at that location.
If a close match is not identified, the search can be rendered unsuccessful 912, which is an appropriate state for text that has been altered beyond recognition.
The present invention may also include software and computer programs incorporating the process steps and instructions described above that are executed in different computers. In the preferred embodiment, the computers are connected to the Internet.
Computer systems 502 and 504 may also include a microprocessor for executing stored programs. Computer 502 may include a data storage device 508 on its program storage device for the storage of information and data. The computer program or software incorporating the processes and method steps incorporating features of the present invention may be stored in one or more computers 502 and 504 on an otherwise conventional program storage device. In one embodiment, computers 502 and 504 may include a user interface 510, and a display interface 512 from which features of the present invention can be accessed. The display interface 512 and user interface 510 could be a single interface or comprise separate components and systems. The user interface 508 and the display interface 512 can be adapted to allow the input of queries and commands to the system, as well as present the results of the commands and queries.
The present invention enables text matching functionality for a documentation review server which would increase productivity of teams of users engaged in review of documents.
Without such a solution, a section of text in a document becomes lost as soon as such it is moved or altered in any way. The advantage of the solution is that it allows the section to be moved and/or altered while retaining the matchability of the section of text.
It should be understood that the foregoing description is only illustrative of the invention. Various alternatives and modifications can be devised by those skilled in the art without departing from the invention. Accordingly, the present invention is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims.
The present invention is related to U.S. patent application Ser. No. ______, filed on ______, 2005 as Express Mail No. EV 327711492 US, entitled COLLABORATIVE DOCUMENT REVIEW, by David Lane Diamond, Michael S. Rubino, and Jeremy Lizt, (Attorney Docket Number 835-010955-US(PAR); OID-2004-080-01) and assigned to the assignee of the instant application, the disclosure of which is incorporated herein by reference in its entirety.