Table of contents extraction with improved robustness

Information

  • Patent Application
  • 20070196015
  • Publication Number
    20070196015
  • Date Filed
    February 23, 2006
    18 years ago
  • Date Published
    August 23, 2007
    16 years ago
Abstract
In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 diagrammatically shows an apparatus for identifying a table of contents.



FIG. 2 shows a similarity matrix for a document consisting of fifteen text fragments.



FIG. 3 diagrammatically shows an identified table of contents.



FIG. 4 shows the similarity matrix of FIG. 2 with a portion blocked out due to range restrictions placed on the table of contents.



FIG. 5 diagrammatically shows a situation in which a substantially contiguous group of table of content entries is erroneously associated by a table of contents extractor with linked text fragments in a copy of the table of contents.



FIG. 6 diagrammatically shows a situation in which a substantially contiguous group of table of content entries is partially correctly associated by a table of contents extractor with linked text fragments that are section headings, and is partially erroneously associated by the table of contents extractor with linked text fragments in a copy of the table of contents.


Claims
  • 1. a method for identifying a table of contents in a document, the method: comprising: extracting text fragments from the document;identifying (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries; andduring the identifying, reducing a number of text fragments that are candidates for identification as linked text fragments based on at least one reduction criterion.
  • 2. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression, text fragments that one of (i) match, or (ii) do not match, the regular expression being excluded as candidates for identification as linked text fragments.
  • 3. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression setting forth an indexing text fragment portion, text fragments being selectively excluded as candidates for identification as linked text fragments based on the comparing.
  • 4. The method as set forth in claim 3, wherein the initial index-identifying text fragment portion is one of (i) a leading numeric index, (ii) a leading alphabetic index, and (iii) a leading roman numeral index.
  • 5. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression setting forth that the text fragment is capitalized, text fragments being selectively excluded as candidates for identification as linked text fragments based on the comparing.
  • 6. The method as set forth in claim 1, wherein the reducing comprises: comparing text fragments with a regular expression setting forth that the text fragment contain at least one keyword selected from a group of keywords consisting of at least one of: “part”, “section”, “chapter”, and “book”, text fragments being selectively excluded as candidates for identification as linked text fragments based on the comparing.
  • 7. The method as set forth in claim 6, wherein the regular expression further sets forth that the at least one contained keyword be located in a selected position or range of positions within the text fragment.
  • 8. The method as set forth in claim 1, wherein the extracting of text fragments includes associating page positions with the text fragments, and the reducing comprises: limiting the candidates for identification as linked text fragments based on the associated page positions.
  • 9. The method as set forth in claim 8, wherein the associated page positions include at least associated vertical page positions, and the limiting comprises: limiting the candidates for identification as linked text fragments to fragments whose associated vertical page position is within a selected distance from a top of the page.
  • 10. The method as set forth in claim 8, wherein the associated page positions include at least column indices, and the limiting comprises: limiting the candidates for identification as linked text fragments to fragments whose associated column index corresponds with a leftmost column.
  • 11. The method as set forth in claim 1, further comprising: structuring the document based on the identified table of content entries and linked text fragments.
  • 12. A method for identifying a table of contents in a document, the method comprising: extracting text fragments from the document;identifying (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries; andvalidating the identified table of contents entries and linked text fragments based on at least one validation criterion related to distribution of the linked text fragments.
  • 13. The method as set forth in claim 12, wherein the at least one validation criterion comprises: validate conditional upon a span of the linked text fragments being greater than a validation fraction threshold of the total span of the extracted text fragments.
  • 14. The method as set forth in claim 13, wherein the validation threshold fraction is between about 10% and about 20%.
  • 15. The method as set forth in claim 12, wherein the at least one validation criterion comprises: validate conditional upon there being no group of contiguous linked text fragments numbering greater than a threshold having one-to-one correspondence with a group of contiguous table of content entries.
  • 16. The method as set forth in claim 12, wherein the at least one validation criterion comprises: validate conditional upon no linked text fragments being within a substantial copy of the substantially contiguous group of text fragments identified as table of content entries.
  • 17. The method as set forth in claim 16, wherein the validating further comprises: searching for the substantial copy of the substantially contiguous group of text fragments identified as table of content entries by comparing the table of content entries with a candidate copy using a figure of merit including at least (i) a measure of textual similarity between the table of content entries and text fragments of the candidate copy and (ii) a longest common string contained in both the table of content entries and the candidate copy.
  • 18. The method as set forth in claim 16, wherein the substantial copy is a partial copy of the substantially contiguous group of text fragments identified as table of content entries.
  • 19. The method as set forth in claim 16, further comprising: conditional upon the validating finding one or more linked text fragments within a substantial copy of the substantially contiguous group of text fragments identified as table of content entries, (i) removing linked text fragments located within the substantial copy from the group of text fragments identified as linked text fragments and (ii) updating the identifying with text fragments within the substantial copy excluded as candidates for identification as linked text fragments.
  • 20. The method as set forth in claim 12, further comprising: structuring the document based on the identified table of content entries and linked text fragments.
  • 21. An apparatus for identifying a table of contents in a document, the apparatus comprising: a text fragmenter that extracts text fragments from the document;a table of contents region identifier that identifies a contiguous sub-set of the text fragments as a table of contents region; anda table of content extractor that identifies (i) a substantially contiguous group of text fragments within the table of contents region as table of content entries, and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries.
  • 22. The apparatus as set forth in claim 21, wherein the table of contents extractor selectively excludes text fragments as candidates for identification as linked text fragments based on comparison with a regular expression.
  • 23. The apparatus as set forth in claim 21, wherein the table of contents extractor selectively excludes text fragments as candidates for identification as table of content entries based on comparison with a regular expression.
  • 24. The apparatus as set forth in claim 21, wherein the text fragmenter associates page positions with the extracted text fragments, and the table of contents extractor limits the candidates for identification as linked text fragments based on the associated page positions.
  • 25. The apparatus as set forth in claim 24, wherein the associated page positions include at least one of (i) vertical page positions and (ii) column indices.
  • 26. The apparatus as set forth in claim 21, wherein the table of contents region identifier comprises: a user interface configured to receive a user identification of the table of contents region.
  • 27. The apparatus as set forth in claim 21, further comprising: a document organizer that structures the document based on the identified table of content entries and linked text fragments.