As used herein, the term “table of contents” is intended to encompass any table listing locations of chapters, sections, or other divisions of a document. As used herein, the term “table of objects” is intended to encompass any table listing locations of objects in a document. For example, a table of objects may be: a table of images listing locations of images in a document; a table of figures listing locations of figures in a document; a table of tables listing locations of substantive tables in a document (the term “substantive table” being used here to denote a table containing content of the document as opposed to an organizational table that provides organization to the document); a table of panels listing locations of panels containing special information offset from the general flow of text; a table of textboxes listing locations of textboxes offset from the general flow of text; or so forth. The term “organizational table” is intended to encompass both tables of contents and tables of objects, but not substantive tables that contain content of the document.
A document typically includes a table of contents and optionally one or more tables of objects. However, it is contemplated for the document to include one or more tables of objects without a table of contents, or for the document to include a table of contents without any tables of objects. It is contemplated for the document to include more than one table of contents (for example, an overall table of contents for a multiple-chapter document along with a table of contents for each chapter). It is contemplated for the document to include more than one table of objects of the same object type (for example, a table of figures for each chapter of a multiple-chapter document) either alone or in addition to a table of contents or table of objects of another type.
The locations of document divisions or objects are typically specified in an organizational table by page numbers; however, other location identifiers are also contemplated, such as specifying location by volume, section, chapter, or so forth, or by some combination of location specifiers such as a volume number and a page number within that volume. In addition to listing the locations of document divisions or objects in the document, an organizational table optionally lists summary or capsule information about the document divisions or objects, such as section heading text, caption text, or so forth. The organizational table may also list document division or object enumerators, such as for example, “FIG. 1”, “FIG. 2”, and so forth in the case of a table of figures.
In an unstructured or shallowly structured document, the organizational table or tables are typically stored integrally with the content of the document. Various techniques can be used for extracting such organizational tables from the document. The output of the organizational tables extractor is one or more organizational tables each including a set of text fragments (possibly represented by pointers to text fragments within the document) corresponding to entries of the organizational table, in which each entry has an associated linked text fragment (again, possibly represented by a document pointer) such as a corresponding chapter heading, section heading, object caption, or so forth. The association or linkage between entries and corresponding linked text fragments (e.g., headings or captions) can be recognized and quantified or ranked based on various criteria, such as use of distinctive heading font size and/or font style, arrangement of text fragments on a page, common textual content or textual similarity, or so forth.
With reference to
With reference to
The resulting ordered sequence of text fragments 14 is processed by a textual similarity links identifier 20 that identifies links 22. Each link is defined by a pair of textually similar text fragments. The text fragments of the pair defining the link are identified herein as source and target text fragments. The source text fragment is a candidate for being an entry of an organizational table, while the target text fragment is a candidate linked text fragment.
There are various ways of defining or identifying such pairs of text fragments. In general, for N fragments, the computation of links is of order O(N2). Additionally, the possible presence of noise in the text should be accounted for. Noise can come from various sources, such as incorrect PDF-to-text conversion, or organizational table-specific problems such as a page number that appears in the organizational table contents but not in the document body, or a series ellipses ( . . . ) that relate the page number to descriptive text in the organizational table. In some embodiments, each text fragment is tokenized into a series of alphanumeric tokens with non-alphanumeric separators such as tabs, spaces, or punctuation signs. In some embodiments, a Jaccard is used to measure textual similarity. The Jaccard is computed as the cardinal of the intersection of the two token sets defined by candidate source and target text fragments divided by the cardinal of the union of these two token sets. A link is defined for those pairs in which the Jaccard measure is above a selected matching threshold. In other embodiments an edit distance or other suitable measure is used as the textual similarity comparison. For an edit distance measure, the threshold is a maximum—those pairs having an edit distance less than an edit distance threshold are designated as textually similar pairs.
With brief reference to
With reference to
Although most of the text fragments of the organizational table 110 are entries 112, a small portion of the text fragments in the contiguous sub-sequence of text fragments defining the organizational table 110 may be holes, rather than entries 112. The holes do not have associated links 114, and do not represent an entry of the organizational table linking to another portion of the document. An example hole 116 is shown in
The second criterion is textual similarity. Each link 114 should connect an entry 112 to a document division heading, object caption, or other linked text fragment having text that is similar to the text of the entry. The textual similarity is suitably measured by the Jaccard or other text similarity measure employed by the textual similarity links identifier 20. The target or linked text fragment is typically a heading of a chapter, section, or other document division in the case of a table of contents, or a caption or heading in the case of a table of objects. For example, in the case of a table of figures the target or linked text fragment may be a figure caption. In the case of a table of tables the target or linked text fragment may be a heading or caption of a substantive table. In general, the heading or caption of an object may be above, below, to the side of, or otherwise positioned respective to the corresponding figure, table, or other object.
The third criterion is ordering. The target or linked text fragments of the links 114 should have an ascending ordering corresponding to the ascending ordering of the entries 112. That is, for a set of entries {#i1, #i2, #i3, . . . } having a set of links {(#i1,#j1), (#i2,#j2), (#i3,#j3), . . . } where the set of entries {#i1, #i2, #i3, . . . } have an ascending ordering, it should follow that the ordering of the corresponding set of target fragments {#j1, #j2, #j3, . . . } is also ascending.
The fourth criterion is lack of self-reference. All of the links 114 should initiate from within the organizational table 110, and none of the links 114 should terminate within the organizational table 110. The set of entries {#i1 #i2, #i3, . . . } and the corresponding set of target text fragments {#j1, #j2, #j3, . . . } should have an empty intersection, and moreover none of the target text fragments {#j1, #j2, #j3, . . . } should correspond to a hole text fragment in the organizational table 110.
With reference to
With reference to
As another example, the regular expression may set forth that the text fragment contain at least one keyword typically indicative of a chapter heading, section heading, or so forth. For example, the keyword may be “part”, “section”, “chapter”, “book”, “Fig.”, “Table”, or so forth, or various combinations thereof. Text fragments which do not satisfy the regular expression because they contain none of the keywords indicative of being a heading are excluded. In some such regular expressions, the location of the keyword may be incorporated into the regular expression. For example, the regular expression may be something such as: “Chapter *” which indicates that the text fragment must begin with the capitalized word “Chapter” followed by a space and any other text (as indicated by the trailing asterisk). In other such regular expressions, the expression may be satisfied if the keyword appears anywhere in the text fragment.
Other regular expressions can be used, alone or in combination. As yet another example, the regular expression may require that the text fragment be in all-caps, so that text fragments containing lower-case letters (or more than one or two lower-case letters, or some other similar pattern) are excluded from further consideration by the textual similarity links identifier 20. While the term “regular expression” is used herein, it is to be appreciated that the comparison with the regular expression may be computationally implemented in various ways, such as using a text search algorithm (for finding a keyword in a text fragment), a finite state network-based automaton (for performing comparisons with simple or complex character string patterns), or so forth. The one or more reduction criteria 28 may also include other criteria such as restrictions on the page position of the linked text fragments, restrictions on the font, font size, font type (e.g., italic, boldface, etc.) or so forth.
With reference to
To enforce the non-self-referencing constraint, a second pass is suitably performed once the extent of an organizational table is tentatively determined with respect to the ordering constraint. Using a second pass accounts for indeterminacy as to the end of the organizational table, as the end of the organizational table is unknown while it is being extended from its start point. The second pass starts at the original starting text fragment at the top of the organizational table. Each subsequent text fragment is tested. If a subsequent text fragment includes links only to text fragments within the organizational table, then it violates the non-self-referencing criterion—accordingly, the second pass would terminate the organizational table just before that non-self-reference violating text fragment. Again, however, it may be advantageous to allow a certain number of holes. This is suitably achieved in the second pass by allowing one or a few text fragments of the organizational table to be self-referencing. These text fragments that violate the self-referencing criterion are assumed to be holes, rather than entries, in the organizational table.
This processing is repeated for each of the N possible starting text fragments. The output of the organizational tables selector 30 is a set of one or more organizational tables, each formed of a contiguous list of text fragments corresponding to text entries. Each text entry has one or more candidate linked text fragments.
Because the organizational tables selector 30 constructed each organizational table in a way that ensures that the ordering and non-self-reference constraints can be obeyed (while optionally allowing for a limited number of holes), it follows that a links optimizer 34 can select for each entry of each organizational table one link from its list of acceptable links so that the ordering and non-self-reference constraints are respected. In the case of a document which includes several organizational tables, it is expected that the organizational tables selector 30 will output a plurality of organizational tables. A links optimizer 34 optimizes the links for each organizational table. The selection of the best link for each of the entries of an organizational table involves finding a global optimum for the organizational table while respecting the four table of contents constraints: contiguity, text similarity, ordering, and non-self-referencing. In some embodiments, a weight is associated to each candidate link, which is proportional to its level of matching. In some embodiments, a Viterbi shortest path algorithm is employed in selecting the optimized links. Other algorithms can also be employed for selecting the optimized links. The output of the links optimizer 34 is a set of one or more organizational tables 40, each including a set of substantially contiguous text fragments defining the entries and associated linked text fragments that are expected to correspond with section headings, figure captions, image captions, table headings or captions, or so forth.
The foregoing organizational tables extractor employing the textual similarity identifier 20, the organizational tables selector 30, and the links optimizer 34 is an illustrative example. Other tables extraction algorithms and systems can be employed that output the one or more organizational tables 40 in which each organizational table includes a substantially contiguous sub-set of text fragments 14 of the document 10 identified as entries of the organizational table and associated linked text fragments that are expected to correspond with section headings, figure captions, image captions, table headings or captions, or so forth.
With continuing reference to
In one scoring approach, the organizational tables scorer 42 assigns a score to each organizational table based on a count of occurrences of a keyword or key phrase in entries of that organizational table, or in the linked text fragments associated with the entries of that organizational table, or in both the entries and linked text fragments of that organizational table. This scoring approach leverages the situation for certain organizational tables in which there may be a common keyword or key phrase that is used in most or all of the entries and/or in most or all of the linked text fragments. For example, in a table of tables, it is typically the case that each caption (that is, each linked text fragment) will include the keyword “Table”. This keyword may also be included in each entry of the table of tables. Similarly, each caption (that is, each linked text fragment) of a table of figures will typically include a keyword such as “Fig.” or “Figure”. Such a keyword or may also be included in the entries of the table of figures. Yet again, each caption (that is, each linked text fragment) of a table of panels will typically include a keyword such as “Panel”. This keyword may also be included in each entry of the table of panels. In one keyword or key phrase based scoring approach in which scoring is based only on the linked text fragments, a score is computed as a count of the linked text fragments containing the keyword or key phrase divided by a count of the linked text fragments. This score should be close to unity for any organizational table in which the captions include the keyword or key phrase, while it should be substantially less than unity for other organizational tables.
With reference to
A sum of the distances of each of Caption 1.1, 1.2, and 1.3 from the closest respective image will produce a small value (that is, a low score) for the selected organizational table indicating that the selected organizational table is a table of images. On the other hand, a sum of the distances of each of Caption 1.1, 1.2, and 1.3 from the closest respective substantive table (namely “Table I” for all three captions) will produce a larger value (that is, a higher score), indicating that the selected organizational table is not a table of tables.
In such a proximity-based scoring approach, objects of the selected type, including positional information in the document, are inputs to the organizational tables scorer 42. In some embodiments, this object information may be partly or completely provided as part of the document conversion process performed by the text fragmenter 12. For example, if the text fragmenter 12 performs conversion to XML, images or certain other objects in the document may be tagged by object type. In some such XML conversion processes, a bounding box may be defined for each image, thus also providing position information.
In some embodiments, the positional information on objects in the document are provided by a suitably configured objects detector 48. For example, an images detector component 50 of the objects detector 48 detects images, a substantive tables detector 52 of the objects detector 48 detects substantive tables, and a textboxes detector 54 of the objects detector 48 detects textboxes. The images detector component 50 outputs a list of images 60 with positions, for example denoted as bounding boxes. In some embodiments, the image detector component 50 is configured to distinguish between images and icons, logos, or other specialized graphics which are not likely to be indexed in a table of images, and only images that are not icons, logos, or the like are added to the list of images 60. The substantive tables detector component 52 outputs a list of substantive tables 62 with positions, for example denoted as bounding boxes. The textboxes detector component 54 outputs a list of textboxes 64 with positions, for example denoted as bounding boxes. Any of the lists of objects 60, 62, 64 may be an empty list, if there are no objects of the corresponding object type in the document 10. Specifying the positional information using a bounding box advantageously identifies the extent of the object; however, the positional information can also be provided in another format, such as by providing coordinates of a centroid of the object, coordinates of a single corner of the object, or so forth. Moreover, while image, substantive table, and textbox detector components 50, 52, 54 are illustrated, it will be appreciated that fewer, additional, or other object detector components can be included in the objects detector 48.
Each of the detector components 50, 52, 54 suitably locates image, table, or textbox objects, respectively, by analysis of the original document 10 or by analysis of a converted or partially converted document (such as a shallow XML document) produced by the text fragmenter 12. For example, if the text fragmenter 12 includes an XML converter component that produces a shallow XML file in which objects of a certain object type are labeled, then the corresponding object detector component suitably makes use of that information. On the other hand, if the text fragmenter 12 does not provide such information, then the original document 10 is suitably analyzed in its native format to detect the objects. With the objects of the selected object type known, including positional information, the organizational tables scorer 42 suitably computes a score based on a proximity measure of the linked text fragments respective to the objects in the document.
With reference to
where the coordinates h, w indicate the vertical and horizontal distances, respectively, between the linked text fragment T and the nearest object O on the page, H, W indicate the vertical and horizontal dimensions, respectively, of the page, and Llink is the proximity measure for the linked text fragment T. Note that the proximity measure of Equation (1) ranges between Llink=0 and Llink=1, with Llink=0 corresponding to a largest distance away on the page and Llink=1 corresponding to a zero distance (e.g., an overlap or contacting adjacency) between the linked text fragment T and the nearest object O. The score for a selected organizational table respective to a selected object type is then given by combining the proximity measures of the linked text fragments (given in Equation (1)), for example using a weighted sum:
where N is the number of linked text fragments associated with entries of the organizational table (or, correspondingly, N is the number of entries in the organizational table), the index n={1, . . . ,N} ranges over all of the linked text fragments, t indexes the selected object type, (Llink)n,t denotes the proximity measure Llink for the nth linked text fragment respective to the nearest object of selected object type t, and (Score)t denotes the score for the organizational table respective to the selected object type t. Since Llink ranges between 0 and 1 and Equation (2) is normalized by the (1/N) factor, it follows that (Score)t given in Equation (2) also ranges between 0 and 1, with higher values indicating closer proximity between objects of the selected object type t and the linked text fragments associated with the entries of the organizational table.
The positionally- or proximity-based scoring of Equations (1) and (2) is an illustrative example. Other measures of proximity of linked text fragments with respective nearest objects of the selected object type can be employed in positionally- or proximity-based scoring. In some contemplated positionally- or proximity-based scoring approaches, the score is adjusted based on whether there are any intervening elements or objects between the linked text fragment and the closest object of the selected object type. The rationale for such a scoring approach is that it is expected that there will be no intervening elements or objects between, for example, an image and its caption.
In some embodiments of the organizational tables scorer 42, different scoring approaches may be used for different object types. For example, if it is expected that most or all tables captions will include the word “Table”, then a keyword-based scoring approach may be appropriate for scoring organizational tables respective to the substantive tables object type. On the other hand, if no keyword or key phrase is expected to be common to most or all image captions, then a positional- or proximity-based scoring approach such as that of Equations (1) and (2) may be more appropriate for scoring organizational tables respective to the images object type.
With reference to
If it is known that there is no more than one organizational table corresponding to each object type (e.g., at most a single table of figures, at most a single table of tables, and so forth), then this information can be incorporated into the labeling process performed by the organizational tables labeler 44. In one approach, the scores of the organizational tables for each object type are ranked from highest to lowest, and the highest-ranked organizational table for each object type is labeled with the corresponding table type. If the highest score is below a selection threshold for a particular object type, then it may be assumed that none of the organizational tables in the document correspond to that object type.
The linked text fragments of a table of contents will generally not be closely associated with objects of any object type. Accordingly, an organizational table that is a table of contents will typically have positionally- or proximity-based scores for the various object types that do not satisfy the selection criterion for any object type. One suitable approach for identifying a table of contents when using positionally- or proximity-based scoring is to assign the table of contents table type to any organizational table that does not satisfy the selection criterion for any object type. In another approach for labeling tables of contents when using positionally- or proximity-based scoring, the selection process is first applied for assigning table types corresponding to object types until all object types have been processed. Any left-over organizational tables (that is, organizational tables that have not been assigned a table type corresponding to any object type) are assigned the table of contents table type by default.
Alternatively or additionally, the organizational tables can be scored respective to the table of contents table type using a keyword- or key phrase-based scoring approach. For example, if the document is known to be organized by chapters, then a keyword-based scoring approach in which a count of the linked text fragments associated with an organizational table that contain the keyword “Chapter” is divided by a count of the total number of linked text fragments associated with the organizational table should provide accurate scoring for the organizational tables respective to the table of contents table type.
With reference to
The disclosed approaches for labeling organizational tables have been applied to PDF documents that contain organizational tables including tables of contents and tables of images. The PDF documents were first converted to XML with a converter that extracted the images and inserted tags indicating bounding boxes for each image on the page. In test runs using five different documents and a positional- or proximity-based scoring approach, the table of images was correctly labeled each time.
The organizational tables scorer 42 and organizational tables labeler 44 should be configured to be sufficiently sensitive to accurately label organizational tables without producing an excessive number of “false positives” in which an organizational table is improperly labeled. For example, in one test run on a document that contained images but no table of images, the method employing a positional- or proximity-based scoring approach nonetheless labeled a table of images. Such false positives can be reduced by optimizing parameters such as the threshold or other selection criterion with respect to a collection of training documents having expected “average” characteristics. In general, making the selection criterion more rigorous (e.g., increasing the threshold for a scoring approach in which a higher score indicates more likely labeling) will reduce false positives. However, if the selection criterion is too rigorous, then the algorithm may fail to properly label existing organizational tables. Incorporation of a scoring component that reduces the score (or otherwise modifies the score away from satisfying the selection criterion) when there is an element or object intervening between the linked text fragment and the nearest object of the object type being scored is also expected to reduce false positives.
The disclosed techniques for labeling organizational tables are expected to be robust against the relatively common situation in which the number of objects of a selected type is different from the number of entries in the corresponding organizational table. Such a situation may arise due to inclusion in the document of additional objects of a particular object type that are not indexed in the corresponding organizational table, or may arise due to spatial overlap of objects, or so forth. Errors in the text fragmentation performed by the text fragmenter 12 can also produce such differences.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
The following U.S. patent applications are commonly owned with the present application and are each incorporated herein by reference. Meunier et al., “Rapid Similarity Links Computation For Table of Contents Determination” (Xerox ID 20051557-US-NP, Ser. No. 11/360,951 filed Feb. 23, 2006) is incorporated herein by reference in its entirety. This application relates at least to table of contents extraction with improved robustness. Meunier et al., “Table of Contents Extraction with Improved Robustness” (Xerox ID 20051557-US-NP, Ser. No. 11/360,963 filed Feb. 23, 2006) is incorporated herein by reference in its entirety. This application relates at least to table of contents extraction with improved robustness. Dejean et al., “Structuring Document based on Table of Contents,” (Xerox ID 20040970-US-NP, Ser. No. 11/116,100 filed Apr. 27, 2005) is incorporated herein by reference in its entirety. This application relates at least to organizing a document as a plurality of nodes associated with a table of contents. Dejean et al., “Method and Apparatus for Detecting a Table of Contents and Reference Determination,” Ser. No. 11/032,814 filed Jan. 10, 2005 and published on Jul. 13, 2006 as U.S. Publ. Appl. 2006/0155703 A1 is incorporated herein by reference in its entirety. This application relates at least to a method for identifying a table of contents in a document. An ordered sequence of text fragments is derived from the document. A table of contents is selected as a contiguous sub-sequence of the ordered sequence of text fragments satisfying the criteria: (i) entries defined by text fragments of the table of contents each have a link to a target text fragment having textual similarity with the entry; (ii) no target text fragment lies within the table of contents; and (iii) the target text fragments have an ascending ordering corresponding to an ascending ordering of the entries defining the target text fragments.