The present invention relates, in general, to electronic documents and, more specifically, to the preservation of document constructs in text-readable documents converted from graphic-represented documents.
Computers and electronics have infiltrated most aspects of life in the modern world. Word processors, scanners, and faxes have led to a proliferation of electronic documents that may be shared with multiple different persons in various different locations. Some electronic documents may be in a text-readable format, such as Hypertext Markup Language (HTML), MICROSOFT CORPORATION's WORD™ DOC format, Real Text Format (RTF), plain text (TXT) format, or the like. These documents are text-readable, such that word searches or text insertions, modification, and/or deletions, may be made directly in the document. Other electronic documents may be in a graphic-represented format, such as ADOBE SYSTEMS INCORPORATED's Portable Document Format (PDF), MACROMEDIA INC's FLASHPAPER™, Tagged Image File Format (TIFF), and the like. Graphic-represented documents may also present text and graphics when displayed. However, the displayed content is represented by text, graphics, patterns, glyphs, or the like, which may be printed and displayed on the monitor. Humans may recognize the text as a particular construct, such as a table or list, the rendering system does not identify such blocks as particular constructs.
Graphic-represented documents have increased in popularity as their mechanisms and formats have become more advanced and more platform neutral. For example, PDF files have become a de facto standard in electronic document publishing. Many electronic documents are now made available on the Internet or other data networks in PDF format because it allows the document to be displayed consistently across many different platforms running a PDF reader and also allows that document to be printed out with the same or similar fidelity of the original document. Moreover, computer-based faxes are typically rendered in TIFF format to be transferred to and from faxing parties, again, because of the consistency and fidelity of the display of the faxed documents on various electronic platforms and the subsequent printing onto hard media. Additionally, FLASHPAPER™ documents may be displayed consistently in MACROMEDIA INC.'s MACROMEDIA FLASH™ player available on most computer platforms.
With the increase in these graphic-represented documents, it sometimes becomes important to be able to convert the graphic-represented document into a text-readable document. For example, a party who receives an electronic fax in TIFF format may desire to convert the TIFF file into an actual text-readable document that he or she may edit in a word processing application. Similarly, if a company is designing an interactive Internet application, such as an on-line help application, it may be desirable to convert electronic support documents, that are in a format such as PDF, into HTML, in order to easily build the Internet application. Such conversions from graphic-represented documents into text-readable documents are typically performed by some kind of Optical Character Recognition (OCR) application. Some PDF documents may include the text in addition to the graphics. However, PDF documents that are created using a scanner typically result in a purely graphical document which would use an OCR function to obtain the underlying represented text.
In OCR, certain algorithms and heuristics may be employed to analyze the graphic illustrating the text character and then make an educated guess at what character is represented. The resulting group of characters are typically saved in a text-readable format. While this process converts the graphic to text, it generally does not interpret the different document constructs of the graphic-represented document. Document constructs may be such elements or styles as tables, lists, columns, and the like. Within the text-readable document, such document constructs are defined with additional style coding within the text-readable document. Therefore, the difference between text that is simply arranged in a paragraph will be-coded or tagged differently from text that is formatted into a document construct, such as a list, table, column, or the like.
Additional conversion applications exist that convert graphic-represented documents to text-readable documents. Some such conversion applications allow a user to physically mark the graphic-represented document to indicate blocks of graphics that represent a particular type of document construct. In order to mark the graphic-represented document, the user would typically draw a bounding box around the specific set of graphics that were the specific document construct and then enter which type of document construct applied to the bounded area. Other specific processes may exist for the user to mark the graphic-represented document, but each such process requires the user to manually inspect the entire document. When the conversion application begins the conversion process, it applies the document construct mark entered by the user to format the converted text according to the particular formatting style or element marked by the user.
While this method provides an accurate way to preserve document constructs in the conversion of a graphic-represented document to a text-readable document, it takes considerable time from a user to go through the entire graphic-represented document to manually mark each separate document construct. Moreover, this method is static, such that any subsequent changes to the document may or may not cause the content to move outside of the annotation, causing a need for the author to re-annotate the entire document.
The present invention is directed to a system and method for automatically analyzing a graphic-represented document to determine various document constructs to preserve in a conversion to a text-readable document that may be freely edited as if the construct was originally native to the resulting application. During the conversion process, the graphic-represented document is rendered in memory as it would be rendered on the visual display or printed. The system establishes a series of horizontal lines either virtually or physically across the document only within the whitespace of the document. Document whitespace is the area of the document that is not covered by graphic-represented text or other graphics. After the document is covered with these horizontal lines, vertical lines are then established either virtually or physically within the document whitespace. As with the horizontal lines, the vertical lines are established, where possible, across the entire document.
When the horizontal and vertical lines have all been applied to the graphic-represented represented document, the system analyzes the sections of the document defined by the line intersections. These areas are examined for any indicia of particular document constructs. The process of establishing the horizontal and vertical lines is continued within each of these sections until the resulting sub-sections are small enough that the conversion application may determine that they are no longer of interest with regard to detecting document constructs.
For example, if the vertical lines traverse the entire length of the document within the available canvas area of the document, this may be an indicia of columns. If the area defined by the intersections result in a series of similarly sized boxes across some area of the available canvas area, this may be an indicia of a table. Moreover, if the area defined by the intersections results in a first column of boxes that are relatively small, and contain bullet glyphs or numbers which are adjacent to a series of other larger boxes in an adjacent column, this may be an indicia of a bulleted or numbered list. Depending on the particular indicia recognized by the system, data that indicates such a particular document construct will be placed with the graphic-represented document. When the graphic-represented document is converted into the text-readable document, the conversion system uses the document construct notation to create the text-readable portion of the document according to the particular document construct. Thus, a column notation will result in the text-readable document being coded for columns. Similarly, a table or list notation will result in the text being tagged or coded as a table or list, respectively. Therefore, the document constructs are preserved in the text-readable document converted from the graphic-represented document without requiring manual notation by the user. These preserved constructs are actually constructed in a manner consistent with the native creation of a similar construct in the format of the host application for the text-readable document. For example, a table converted into a WORD™ document will be a WORD™ table. Similarly, a list converted into an HTML document will be created as an HTML list.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized that such equivalent constructions do not depart from the invention as set forth in the appended claims. The novel features which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present invention.
For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
It should be noted that in various embodiments of the present invention, when purely graphical formatted documents are used, an OCR function could be applied in order to obtain the actual text of the underlying document.
A text-readable document, such as HTML, DOC format, RTF format, and the like, includes additional tagging or coding that identifies any particular block of text, such as the text in box 101, as a table. The displaying application, such as a Web browser for HTML documents, or a word processor for DOC and/or RTF format documents, uses that coding to arrange the text into the related special construct, such as a table. Thus, when a conversion tool configured according to one embodiment of the present invention is used to convert electronic document 10 into a text-readable document additional descriptive data is automatically added to the document information stream describing electronic document 10 in order to signal the conversion application that block 102 contains a table.
Conversion application 204 establishes varying width horizontal lines within the whitespace of the PDF document rendered in computer memory 206. Vertical lines are then established within the whitespace. The horizontal and vertical lines only cover or correspond to the whitespace and do not cross into the text, graphics, or glyphs. Once the lines are established, conversion application 204 analyzes the portions of PDF document within the intersection of the horizontal and vertical lines. Depending on the pattern or shape of the rectangles defined by the intersecting lines, conversion application 204 determines what type of document constructs, if any, are displayed on the PDF document and places descriptive data indicating which document construct is present. Through converting the PDF document into text, conversion application 204 creates a document information stream that will be used to generate the resulting HTML document. A document information stream is a stream of data and instructions that may typically be used when sending a particular document to a printer, or for rendering on the screen. The information stream instructs how to actually render, display, or create the image of the document.
When used in conjunction with the various embodiments of the present invention, document construct data is added to the document information stream to identify which portions of the PDF document are a paragraph, table, list, column, a paragraph, a graphical annotation, an annotated graphic (which is a graphic containing annotations, as opposed to a graphical annotation), an article, or the like. Thus, when generating the HTML document, document construct tabs are placed around the text corresponding to the document construct that was graphically rendered in the PDF document. The HTML document may then be displayed on computer display 207 which will show the corresponding document construct tags.
Once vertical lines 307 and 308 have been established in memory, the conversion application begins analyzing the rectangles that are created by the intersections. Based on the patterns of the various intersections, the conversion application will divide electronic document 10 into a number of discrete divisions or rectangles. The major divisions or rectangles identified by the conversion application for electronic document 10 are rectangles 309-311. Rectangle 309 incorporates the header information of electronic document 10. Rectangle 310 incorporates the body text, and rectangle 311 incorporates the footer information of electronic document 10. The conversion application will continue making passes establishing horizontal and vertical lines within each of the defined rectangles, such as rectangles 309-311, until the size of the division or rectangle becomes small enough that the conversion application can determine it is of no further interest.
As the conversion application analyzes the intersections created in rectangle 310 by vertical line 313, it recognizes the consistent widths of the horizontal lines separating rectangles 317-321 and determines that each of those should be separate rectangles instead of a single rectangle having multiple horizontal divisions. Similarly, the conversion application recognizes both the consistent widths and the pattern of the horizontal lines spanning rectangle 322 and creates rectangle 322 instead of multiple rectangles.
It should be noted that the conversion application creates a data structure of each rectangle created in the line establishment process associating each rectangle with that rectangles parent rectangle, i.e., each rectangle that is contained within a rectangle.
The conversion application analyzes the intersections and determines that rectangle 315 defines a heading construct, rectangles 317-320 define normal paragraphs, and rectangles 329 and 330 combine to define a two-column table. Rectangle 328 is determined to be another heading of some sort. The conversion application uses a knowledge base to make the determinations of what type of document constructs are defined by the rectangles or divisions created by the multiple passes of horizontal and vertical lines.
For example, computer program logic within the conversion application may determine that repetitive horizontal lines that are as wide as the typical character height may be defining a normal paragraph construct. However, if those lines are bisected by a vertical line with some kind of width, depending on the arrangements of the divisions or rectangles within the document, the conversion application may determine the rectangles to define a multi-column document or a table or a list. Through running of the computer program logic, the conversion application may determine that the dividing vertical lines create a smaller division or rectangle that contains a bullet glyph or number in front of a larger rectangle, such as rectangle 329. The smaller relation of the division or rectangle containing the bullet or number may indicate that the rectangles combine to define a bulleted or numbered list. Thus, the computer program logic of the conversion application analyzes the relationships between graphics/glyphs and text, as well as text formatting to perform its pattern recognition for determining the various document constructs.
Once the conversion application has finished analyzing each of the divisions or rectangles defined by the horizontal and vertical lines, construct codes are generated and added to the information stream defining electronic document 10. For example, if the conversion application were converting electronic document 10 into an HTML document, the text of rectangles 317-320 would be converted into HTML by spanning the text with HTML paragraph tags. Furthermore, the text of rectangles 329 and 320 would be converted to HTML by incorporating the appropriate HTML table tags generated by the conversion application. It should be noted that the conversion application would have divided rectangles 329 and 320 further to define the cell contents of the represented table construct. Thus, when generating the HTML table tags, the conversion application is capable of placing the correct table tag corresponding to the appropriate table cell.
The software logic of the conversion application operating on electronic document 40 analyzes subdivisions 404 and 405 and determines that, with the inclusion of vertical line 403 separating divisions 404 and 405, the combination of divisions 404 and 405, in which a series of bullet glyphs vertically align with the blocks of text in division 404, defines a pattern that may be interpreted as a bulleted list. As the conversion application continues to convert electronic document 40 into another type of document, such as a DOC format file, it will generate formatting code to apply to the document information stream which defines the graphically represented text and bullets in divisions 404 and 405 as a DOC file bullet list.
Division 401 is further divided by the application of vertical line 406, which, after the software logic of the conversion application analyzes the available whitespace and intersections with vertical line 406, creates subdivision 408. Conversion application then establishes vertical line 407 which further divides subdivision 408 separating bullet division 409 from the remaining text in subdivision 408. Once again, the software logic of the conversion application determines that the size and relationship of bullet division 409 with its bullet glyphs and subdivision 408, along with the location of vertical line 407 defines another bulleted list. Therefore, as the conversion application continues to convert electronic document 40 into another type of document, such as a DOC format file, it will generate formatting code to apply to the document information stream that defines the graphically represented text and bullets in bullet division 409 and subdivision 408 as a DOC file bullet list.
The conversion application, thus, not only uses the spacing of the rectangles or divisions to determine and interpret the various constructs, but also considers what is actually contained in the adjoining areas. It considers the spacing, alignment, adjoining constructs, glyphs, graphics, text, formatting, and the like to identify patterns that may then be compared against a database of known construct patterns.
It should be noted that the various embodiments of the present invention illustrated in
In step 705, at least one of the regions are then analyzed for indicia of a document construct, such as a table, a list, a column, or the like. In step 706, a construct indicator is inserted within data describing the graphic-represented document responsive to the analysis. In step 707, the graphic-represented document is converted into a text-readable document, such as a TXT file, an (RTF) file, a MSWORD™ DOC file, a WORDPERFECT™ document \WPD file, an HTML document, an XML document, or the like, using the data describing he graphic-represented document.
It should be noted that in the examples described above with regard to
The program or code segments making up the various embodiments of the present invention can be stored in a computer readable medium or transmitted by a computer data signal embodied in a carrier wave, or a signal modulated by a carrier, over a transmission medium. The “computer readable medium” may include any medium that can store or transfer information. Examples of the computer readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette, a compact disk CD-ROM, an optical disk, a hard disk, a fiber optic medium, a radio frequency (RF) link, and the like. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, and the like. The code segments may be downloaded via computer networks such as the Internet, Intranet, and the like.
Bus 802 is also coupled to input/output (I/O) controller card 805, communications adapter card 811, user interface card 808, and display card 809. The I/O adapter card 805 connects storage devices 806, such as one or more of a hard drive, a CD drive, a floppy disk drive, a tape drive, to computer system 800. The I/O adapter 805 is also connected to a printer (not shown), which would allow the system to print paper copies of information such as documents, photographs, articles, etcetera. Note that the printer may be a printer (e.g. dot matrix, laser, etcetera.), a fax machine, scanner, or a copier machine. Communications card 811 is adapted to couple the computer system 800 to a network 812, which may be one or more of a telephone network, a local (LAN) and/or a wide-area (WAN) network, an Ethernet network, and/or the Internet network. User interface card 808 couples user input devices, such as keyboard 813, pointing device 807, etcetera to the computer system 800. The display card 809 is driven by CPU 801 to control the display on display device 810.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one will readily appreciate from the disclosure, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.