The disclosure of Japanese Patent Application No. JP2003-430185 filed Dec. 25, 2003, entitled “Information Partitioning Apparatus, Information Partitioning Method and Information Partitioning Program.” The contents of that application are incorporated herein by reference in their entirety.
The present invention relates to an information partitioning apparatus, an information partitioning method and an information partitioning program used to partition an electronic document containing a plurality of blocks of information, which may be adopted to partition and sort information such as patent publications, court rulings and newsletters provided as electronic documents.
With the popularization of the advanced network technologies such as the Internet achieved in recent years, network users are able to access great volumes of electronic documents and technologies whereby such large volumes of document information are automatically sorted have come to constitute a vital part of electronic communication. Information provided as electronic documents include, for instance, patent publications. A patent publication is a document containing a plurality of blocks of information including the title of the invention, claims and the effect of the invention. It is necessary to partition the document in correspondence to the individual blocks of information in order to sort the different blocks of information in the document.
Japanese Patent Laid Open Publication No. 2000-285140 (Patent Literature 1) discloses an apparatus that sorts the contents of a document by partitioning it into document portions. The apparatus includes a partitioning means that partitions document data based upon structure information (HTML tags and character font information) with regard to the document data to assist the process of information sorting.
In addition, Japanese Patent Laid Open Publication No. 2001-109772 (Patent Literature 2) discloses an apparatus that extracts article portions containing keywords preregistered by a user in a document containing a plurality of articles with different contents such as an electronically distributed newsletter and sorts the document in units of the individual keywords.
However, the apparatus disclosed in Patent Literature 1 cannot be utilized effectively in conjunction with documents which, unlike patent publications, do not have distinct structure information.
The apparatus disclosed in Patent Literature 2, on the other hand, is capable of extracting a portion of a document such as a newsletter without distinct structure information as a unit article. However, newsletters include those containing articles and “advertorials” together and those in which articles are presented in units of different fields of interest such as politics, economics and sports, and there are also documents such as patent publications containing information provided under different entries, e.g., the title, the claims and the embodiments. When handling any of such documents, the apparatus disclosed in Patent Literature 2 cannot sort the document into unit articles in correspondence to the individual article categories, i.e., “article” and “advertorial” or cannot sort the document into unit articles in correspondence to the individual topics or the individual entries.
Furthermore, aside from patent publications and newsletters mentioned above, there are other diverse types of electronic documents that contain a plurality of blocks of information. It would be a complicated and time-consuming process to manually prepare or a means or a program for partitioning each of such diverse types of documents in a desirable manner.
Accordingly, the arrival of an information partitioning apparatus, an information partitioning method and an information partitioning program that allow an electronic document with no distinct structure information to be partitioned into individual blocks of information in a desirable manner has been eagerly awaited.
In order to achieve the object described above, a first aspect of the present invention provides an information partitioning apparatus that partitions an electronic document input thereto, comprising a means for reference source document storage in which a reference source document describing in the form of an electronic document only superficial characteristics common among a plurality of electronic documents to undergo processing is stored and a means for document comparison that compares the input electronic document with the reference source document stored in the means for reference source document storage and partitions the input electronic document into document portions each constituted of a portion of the input electronic document which is not included in the reference source document and is only present in the input electronic document or a portion of the input electronic document which is an alteration of a portion in the reference source document.
A second aspect of the present invention provides an information partitioning method for partitioning an input electronic document by using a reference source document prepared in advance which describes in the form of an electronic document only superficial characteristics common among a plurality of electronic documents to undergo processing, having a document comparison step in which the input electronic document is compared with the reference source document and the input electronic document is partitioned into document portions each constituted of a portion of the input electronic document which is not included in the reference source document and is only present in the input electronic document or a portion of the input electronic document which is an alteration of a portion of the reference source document.
The information partitioning program achieved in a third aspect of the present invention is characterized in that the step executed in the information partitioning method according to the present invention achieved in the second aspect and the data that need to be prepared in advance when adopting the information partitioning method are described by using codes that can be processed on a computer.
By adopting the present invention, a reference source document is prepared in advance and an input electronic document is partitioned through comparison of the input electronic document with the reference source document. As a result, even an electronic document without distinct structure information can be partitioned into blocks of information (document portions) in a desirable manner.
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
The following is a detailed explanation of the information partitioning apparatus, the information partitioning method and the information partitioning program achieved in the first embodiment of the present invention, given in reference to the drawings.
An information partitioning apparatus 100 in the first embodiment shown in
The document comparison unit 101 compares an input document with a reference source document which is to be described later and detects an edit status indicating an increase/decrease or an alteration manifesting between data in the reference source document and data in the input document and the corresponding data areas (both in the reference source document and in the input document). The document comparison unit 101 may be achieved by adopting, for instance, the method disclosed in reference literature “E. Myers, “An O (ND) (Difference Algorithm and Its Variations”, Algorithmica 1, 2 (1986), pp. 251-266”.
The edit status indicates the comparison results obtained at the document comparison unit 101 as described above, which are classified as “match”, “alter”, “insert” or “delete”. The document comparison unit 101 indicates “match” when it detects identical expressions at a given position i in the reference source document and at a given position j in the input document. The document comparison unit 101 indicates “alter” when it detects a given area (a range from a given position i to another position i+n (n≧0)) in the reference source document replaced with a given area (ranging from a given position j to another position j+m (m≧0)) in the input document. “insert” is indicated when the document comparison unit 101 detects that the input document includes a character string inserted between a given position i and a given position i+1 in the reference source document. “delete” is indicated when the document comparison unit 101 detects that a given area (ranging from a given position i to another position i+n (n≧0)) in the reference source document is deleted from the input document.
The comparison result storage unit 102 stores in memory the results of the comparison executed by the document comparison unit 101. The comparison result storage unit 102 stores in memory data indicating a reference source document edit start position, an input document edit start position and an input document edit end position in correspondence to each detected edit status, as shown in
The labeling unit 103 assigns sorting labels to individual areas in the input document by using the data stored in the comparison result storage unit 102 and data contained in the reference source document/label correspondence data 105, which are to be detailed later.
The labeling result storage unit 106 stores in memory the results of the processing (the labeling results) executed by the labeling unit 103. The labeling result data recorded in the labeling result storage unit 106 may be data indicating input document start positions, input document end positions and labels, which are stored separately from the input document, such as those shown in
The reference source document data 104 constitute a reference source document (reference source document data) input to the document comparison unit 101. It is to be noted that the term “reference source document data” may be used to refer to the data themselves or to the storage area where the data are stored in this specification. The reference source document, which is used to extract portions of the input document to be sorted (hereafter referred to as document portions), contains character strings in lines constituting, for instance, break points between document portions in units of individual lines by maintaining the original arrangement of the lines.
The reference source document/label correspondence data 105 indicates positions in the reference source document, edit statuses ascertained as the comparison results and labels, as shown in
Next, the operation executed (the information partitioning method adopted) in the information partitioning apparatus 100 in the first embodiment having the structure described above is explained. It is to be noted that the following explanation is given on a specific example in which the document (data) shown in
It is to be noted that the document may be input through a document input unit (not shown) by adopting any input method. For instance, document data downloaded via a network from a provider, either free of charge or for a fee, may be input. Alternatively, document data may be read out from a recording medium such as a flexible disk or a CD-ROM and the document data thus read out may be input. In addition, a document may be entered through a keyboard or a paper document may be converted to an electronic document through OCR (optical character reader) and then may be input. Moreover, an e-mail may be directly input or an e-mail taken in from a mail server may be input. In such a case, the main text portion alone may be input by first slicing out the main text portion.
The document input through the document input unit is then transferred to the document comparison unit 101 as character string data. The document comparison unit 101 executes a comparison of the input document with the reference source document and detects differences between the two documents. The document comparison unit 101 adopting, for instance, the document comparison method disclosed in the reference literature mentioned above detects the differences between the two documents by extracting in sequence the document data in the reference source document and the input document in units of the individual lines, comparing the individual lines to ascertain whether or not they contain identical character strings and looking for matching lines so as to minimize the number of unmatched lines.
In
By minimizing the number of lines that are left unmatched, the document comparison unit 101 detects the line at position 2 in the reference source document REF and the line at position 3′ in the input document IN, the line at position 3 in the reference source document REF and the line at position 10′ in the input document IN, and the line at position 4 in the reference source document REF and the position 11′ in the input document IN as sets of matching lines. It is to be noted that the line at position 0 immediately preceding the first line in the reference source document REF and the line at position 0′ immediately preceding the first line in the input document IN (a hypothetical combination that does not exist) and the line at position 5 immediately following the last line in the reference source document REF and the line at position 14′ immediately following the last line in the input document IN (a hypothetical combination that does not exist) are both regarded as sets with matching lines.
After detecting the matching lines in the reference source document REF and the input document IN as described above, the document comparison unit 101 generates (data indicating) the comparison results to be stored into the comparison result storage unit 102. The comparison result data in
It is to be noted that the result data stored in the comparison result storage unit 102 may indicate all the types of edit statuses, i.e., “match”, “alter”, “insert” and “delete”, may indicate three different types of edit statuses, i.e., “alter”, “insert” and “delete” or may indicate two different types of edit statuses, “alter” and “insert”. Namely, while document portions can be sorted and extracted as long as at least the two edit statuses, i.e., “alter” and “insert”, can be recognized, faster processing may be achieved depending upon the specific structure of the comparison result storage unit 102 if “match”, “alter”, “insert” and “delete” or “alter”, “insert” and “delete” output from the document comparison unit are directly stored without first sifting the output.
Between the two successive matched lines in the reference source document REF, i.e., between the line at position 0 and the line at position 2, a line at position 1 is present, whereas there are two lines present between the corresponding pair of matched lines at position 0′ and position 3′ in the input document IN. These two lines do not match the line at position 1 in the reference source document and, accordingly, the edit status “alter”, the reference source document edit start position “1” “the input document edit start position ” 1′” and the input document edit end position “2′” are stored as the first set of records in the comparison results data.
There is no line between the next two matched lines at position 2 and position 3 in the reference source document REF, whereas there are six lines present between the corresponding matched lines in the input document IN, i.e., between the lines at position 3′ and the line at position 10′. Accordingly, the edit status “insert”, the reference source document edit start position “2”, the input document edit start position “4′” and the input document edit end position “9′” are stored as the next set of records in the comparison results data.
In addition, since there is no line present between the next two matched lines at positions 3 and 4 in the reference source document REF, and also, there is no line present between the corresponding matched lines in the input document IN, i.e., between the line at position 10′ and the line at position 11′. Since the edit status is not either “insert” or “alter”, the data corresponding to the results of this particular comparison are not stored into the comparison result storage unit 102.
The third set of records in
Next, the labeling unit 103 assigns labels by using the reference source document/label correspondence data 105 and the data at the comparison result storage unit 102.
The labeling unit 103 extracts a set of the result data (a set of records) in the comparison result storage unit 102 (S701) and makes a decision as to whether or not the edit status in the extracted result data indicate “alter” or “insert” (S702, S703).
If it is judged that the edit status in the extracted result data does not indicate either “alter” or “insert” (in other words, if the edit status is “delete” or “match”), the labeling unit 103 makes a decision as to whether or not there are result data yet to be processed (S710), and the operation returns to step S701 to extract another set of result data if it is judged that there are still unprocessed results data, whereas the sequence of processing in
If the edit status is judged to indicate “insert” or “alter”, the reference source document start position in the same set of result data is ascertained (S704). Then, by using the combination of the edit status and the reference source document start position as a key, the reference source document/label correspondence data 105 are searched to find the corresponding set of records (S705, S706). In other words, a set of records indicating a position matching the reference source start position and an edit status matching the detected edit status is found in the reference source document/label correspondence data 105.
Once the search is executed successfully, the corresponding character string area (document portion) in the input document is extracted (S707) based upon the input document edit start position and the input document edit end position in the results data, ascertains the value (label) stored in the label field in the records searched from the reference source document/label correspondence data 105 (S708), the label thus obtained is attached to the extracted character string area (document portion) and the labeled document portion is stored into the labeling result storage unit 106 (S709). The data stored into the labeling result storage unit 106 may be the type of data such as that shown in
The processing described above (in steps S701 through S709) is repeatedly executed until there are no more comparison result data that can be processed (S710), and once the comparison result data have all been processed, the sequence of processing in
For instance, if the first set of comparison result data in
Since there are other sets of result data yet to be processed at this point, the second set of result data in
Since there is another set of result data yet to be processed at this point, the third set of result data in
When data are stored in the labeling result storage unit 106 in the data format shown in
For instance, based upon the first set of data in
The group of labeled document portions, such as that shown in
It is to be noted that no restriction is imposed with regard to the method of output, and the user may be allowed to specify a given label to output the document portion corresponding to the specified label alone, instead of having all the document portions output.
As described above, the first embodiment achieves advantages in that a character string area (document portion) corresponding to a specific type of information can be recognized and extracted from a processing target document which may not always have a distinct structure in compliance with XML, HTML or SGML, simply by preparing a reference source document describing superficial characteristics (character strings or horizontal lines indicating various entries, character strings or horizontal lines present at break points of different entries etc.) that often appear in documents to be sorted.
It also achieves an advantage in that by using the labeling data prepared in correspondence to the reference source document, a label can be assigned to a character string area (document portion) that has been recognized or extracted.
Next, the information partitioning apparatus, the information partitioning method and the information partitioning program achieved in the second embodiment of the present invention are described in detail in reference to drawings.
In addition to the components of the information partitioning apparatus 10 in the first embodiment, the information partitioning apparatus 10A achieved in the second embodiment includes a reference source document data generation unit 107 and a reference source document/label correspondence data generation unit 108. Since components other than these have functions identical to those in the first embodiment, their explanation is omitted.
The reference source document data generation unit 107 generates a reference source document 104 based upon two documents (document data) input thereto and stores the generated reference source document in its storage unit. The specific method adopted to generate the reference source document 104 is to be explained later in reference to the operation of the information partitioning apparatus.
The reference source document/label correspondence data generation unit 108 generates the reference source document/label correspondence data 105 to be used at the labeling unit 103 and stores the generated reference source document/label correspondence data in its storage unit. The specific method adopted to generate the reference source document/label correspondence data 105 is to be described later in reference to the operation of the information partitioning apparatus.
The individual operations executed at the reference source document data generation unit 107 and the reference source document/label correspondence data generation unit 108 differentiate the information partitioning apparatus in the second embodiment from the information partitioning apparatus in the first embodiment, and accordingly, the following explanation focuses on the operations executed at the reference source document data generation unit 107 and the reference source document/label correspondence data generation unit 108.
Two different documents (document data) having similar superficial characteristics are input to the reference source document data generation unit 107 through a data resource document input unit (no reference numeral assigned). For instance, the document shown in
At the reference source document data generation unit 107, the two documents having been input are first compared with each other. The documents may be compared through a method similar to that adopted in the means for document comparison 101 explained in reference to the first embodiment. If the document comparison execution unit is mainly constituted in software, its processing routine may be used by both the means for document comparison 101 and the reference source document data generation unit 107.
Once the processing executed by the reference source document data generation unit 107 is completed, processing by the reference source document/label correspondence data generation unit 108 starts. The reference source document/label correspondence data generation unit 108 works in collaboration with the user to generate the reference source document/label correspondence data.
The reference source document/label correspondence data generation unit 108 first correlates portions of the reference source document generated by the reference source document data generation unit 108 to portions of a document (preferably a document used as a resource when generating the reference source document) used for the generation of the reference source document/label correspondence data. Namely, it recognizes the lines in the resource document corresponding to the specific lines in the reference source document.
Next, the reference source document/label correspondence data generation unit 108 recognizes portions with edit statuses that can be judged to indicate “insert” or “alter” on the premise that the corresponding relationship described above indicates matching lines (through processing similar to the processing executed by the means for document comparison 101), and determines values to indicate the “reference source document start positions” and “edit statuses” in the reference source document/label correspondence data. At this point, the data do not include any values corresponding to the labels in
In order to determine the value (label name) to indicate the label for the first set of records in
The reference source document/label correspondence data generation unit 108 subsequently outputs the complete reference source document/label correspondence data generated as described above as the reference source document/label correspondence data 105 and stores (registers) them in its storage unit.
In addition to the advantages of the first embodiment, the second embodiment achieves an advantage in that a reference source document can be automatically generated. Once a given reference source document and reference source document/label correspondence data are prepared, a document subsequently input can be sorted by using these data.
While two documents are compared with each other by the document comparison unit 101 and the reference source document generation unit 107 in units of individual lines in the embodiments described above, the two documents may instead be compared in units of individual characters, or in units of individual words after executing morphological analysis processing. As a further alternative, the two documents may be compared through a combination of character-based comparison and word-based comparison.
In addition, while an input document is first partitioned into document portions and then labels are assigned to the individual document portions in the embodiments explained above, the document partitioning apparatus may simply partition the input document into document portions instead.
Furthermore, while an explanation is given above on the embodiments in reference to a single reference source document, a plurality of reference source documents of different types such as a reference source document to be used in conjunction with patent specifications, a reference source document to be used in conjunction with patent applications, a reference source document to be used in conjunction with newsletters and a reference source document to be used in conjunction with court rulings may be provided and, in such a case, a plurality of sets of reference source document/label correspondence data should be provided in correspondence. For instance, before inputting the document to be sorted, the user may specify the reference source document to be used to the apparatus, or the input document may be compared with all the reference source documents and then the subsequent processing may be executed by using the reference source document with the greatest number of matching lines as a valid reference source document. Alternatively, the reference source document may be automatically selected by ascertaining whether or not a given document contains character strings or character string patterns (e.g., a newsletter title) inherent to a specific type of document (patent specification, newsletter or court ruling).
While two documents are input to the reference source document generation unit 107 in the second embodiment, three or more different documents may instead be input, and in such a case, the reference source document may be created by including the lines that are commonly present in all the documents, or by including matching lines found in a predetermined number of documents (e.g., in the majority of documents).
In addition, while the apparatus automatically determines the “positions” and the “edit statuses” in the reference source document/label correspondence data and “labels” are entered by the user in the second embodiment, reference source document/label correspondence data may be generated by adopting another method. For instance, the “positions”, the “edit statuses” and the “labels” may all be entered by the user or the “positions”, the “edit statuses” and the “labels” may all be automatically determined by the apparatus. The label values may each be constituted with the entire character string in the first line of the document portion corresponding to a given edit status in the resource document or a character string enclosed within parentheses in the first line.
Number | Date | Country | Kind |
---|---|---|---|
JP2003-430185 | Dec 2003 | JP | national |