Printed publications are usually designed and edited professionally. The trend is to move from print content to a digital format, and provide the digital content online in a document. Some publishers offer publications digitally with use of a portable document format (PDF). PDF has been used as a standard for document exchange. An example is ADOBE® Acrobat, available from Adobe Systems Inc., San Jose, Calif. Existing text segmentation techniques may not perform well for documents in digital format, such as contemporary consumer magazines.
In the following description, like reference numbers are used to identify like elements. Furthermore, the drawings are intended to illustrate major features of exemplary embodiments in a diagrammatic manner. The drawings are not intended to depict every feature of actual embodiments nor relative dimensions of the depicted elements, and are not drawn to scale.
An “image” broadly refers to any type of visually perceptible content that may be rendered on a physical medium (e.g., a display monitor or a print medium). Images may be complete or partial versions of any type of digital or electronic image, including: an image that was captured by an image sensor (e.g., a video camera, a still image camera, or an optical scanner) or a processed (e.g., filtered, reformatted, enhanced or otherwise modified) version of such an image; a computer-generated bitmap or vector graphic image; a textual image (e.g., a bitmap image containing text); and an iconographic image.
A “computer” is any machine, device, or apparatus that processes data according to computer-readable instructions that are stored on a computer-readable medium either temporarily or permanently. A “software application” (also referred to as software, an application, computer software, a computer application, a program, and a computer program) is a set of machine readable instructions that an apparatus, e.g., a computer, can interpret and execute to perform one or more specific tasks. A “data file” is a block of information that durably stores data for use by a software application.
The term “computer-readable medium” refers to any medium capable storing information that is readable by a machine (e.g., a computer). Storage devices suitable for tangibly embodying these instructions and data include, but are not limited to, all forms of non-volatile computer-readable memory, including, for example, semiconductor memory devices, such as EPROM, EEPROM, and Flash memory devices, magnetic disks such as internal hard disks and removable hard disks, magneto-optical disks, DVD-ROM/RAM, and CD-ROM/RAM.
As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
Text segmentation can be the first step toward reuse and repurposing of documents, including PDF documents. Existing text segmentation algorithms for PDF documents may not perform well for contemporary consumer magazines.
A system and method herein are applicable to PDF documents that are in true PDF format. As used herein, a PDF document in true PDF format is generated, for example, using a text processor, from a type of text markup, using a form of type-setting, or using a design or editing tool. The PDF documents may be generated using a converter. For example, the PDF documents may be generated using a typesetting system that creates PDF documents, or generates PDF documents using a PDF formatter, from an Extensible Markup Language (XML) file, a Hypertext Markup Language (HTML) file, a HTML file with Cascade Style Sheet (CSS), or a Scalable Vector Graphics (SVG) file. The PDF documents may be generated using an editor. The PDF documents may be generated using a development library. For example, the PDF documents may be generated using a PHP: Hypertext Preprocessor (PHP) library (including GOOGLE® fPDF), a C library, C++ library derived from Xpdf, or a Python-based PDF creation library. The PDF document may be generated from Javascript, a HTML file, an Extensible Hypertext Markup Language (XHTML) file, or HTML with CSS. The PDF document may be generated using PDF creator, such as a desktop publishing application. In an example, the PDF documents include searchable text. In an example, the PDF document is not a scanned document.
According to a system and method described herein, provided herein is a novel system and method for text segmentation from a document. The new local homogeneity measure is based on line space. A system and method described herein incorporate this feature into a region growing algorithm. Using a fixed set of parameters, a system and method described herein can achieve robust performance on documents, including PDF magazines, with wide-ranging layouts and styles.
Non-limiting examples of a document include portions of a web page, a brochure, a pamphlet, a magazine, and an illustrated book. In an example, the document is in static format. Some document publisher standards address only the issue of reflowing text. Recent document publishers developed to be run on portable document viewing devices use a significant amount of work by graphics and interaction designers to manually reformat the content and wire the user interactions. Non-limiting examples of portable viewing devices include touch-based devices, including smart phones, slates, and tablets, and other portable document viewing devices.
A system and method are provided for segmenting content from static documents, including digital publications such as magazines in true PDF format.
A PDF document can accurately preserve the visual appearance of electronic documents across application software, hardware, and operating systems, making it a widely used format for document sharing and archiving. However, PDF does not maintain logical structures of document content, such as words, paragraphs, titles, and captions. The lack of structural information can make it difficult to reuse and repurpose the digital content represented by a PDF document. A system and method provided herein for extracting logical structures from PDF documents has many real applications.
In some examples, the document segmentation system 10 outputs the results from operation of document segmentation system 10 by storing them in a data storage device (including, in a database) or rendering them on a display (including, in a user interface generated by a software application). Example displays include the display screen of portable viewing devices, such as touch-based devices, including smart phones, slates, and tablets, and other portable document viewing devices.
Interactions may be made with the computer system 140 (e.g., by entering commands or data) using one or more input devices 150 (e.g., but not limited to, a keyboard, a computer mouse, a microphone, joystick, a touchscreen or a touch pad). Information may be presented through a user interface that is displayed to a user on the display 151 (implemented by, e.g., a display monitor), which is controlled by a display controller 154 (implemented by, e.g., a video graphics card). The display 151 can be a display screen of a portable viewing device. The computer system 140 also typically includes peripheral output devices, such as speakers and a printer. One or more remote computers may be connected to the computer system 140 through a network interface card (NIC) 156.
As shown in
In general, the document segmentation system 10 typically includes one or more discrete data processing components, each of which may be in the form of any one of various commercially available data processing chips. In some implementations, the document segmentation system 10 is embedded in the hardware of the media viewing device. In some implementations, the document segmentation system 10 is embedded in the hardware of any one of a wide variety of digital and analog computer devices, including desktop, workstation, and server computers. In some examples, the document segmentation system 10 executes process instructions (e.g., machine-readable code, such as computer software) in the process of implementing the methods that are described herein. These process instructions, as well as the data generated in the course of their execution, are stored in one or more computer-readable media. Storage devices suitable for tangibly embodying these instructions and data include all forms of non-volatile computer-readable memory, including, for example, semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices, magnetic disks such as internal hard disks and removable hard disks, magneto-optical disks, DVD-ROM/RAM, and CD-ROM/RAM.
The principles set forth in the herein extend equally to any alternative configuration in which document segmentation system 10 has access to a set of documents 12. As such, alternative examples within the scope of the principles of the present specification include examples in which the document segmentation system 10 is implemented by the same computer system (including the computing system of a media viewing device), examples in which the functionality of the document segmentation system 10 is implemented by a multiple interconnected computers (e.g., a server in a data center, including a data center n a cloud, and a user's client machine, including a portable viewing device), examples in which the document segmentation system 10 communicates with portions of computer system 140 directly through a bus without intermediary network devices, and examples in which the document segmentation system 10 has a stored local copies of the set of documents 12 that are to be transformed.
Referring now to
Text segmentation can be a first step taken towards logical structure extraction. Low level text entities can be grouped into line segments and homogeneous blocks. A system and method provided herein targets more complex PDF documents than those of simple style and layout. Text line segments need not be grouped based only on if they have the same font name, point size, and line space. Text line segments need not be required to have homogeneity regarding color to be grouped. Strict conditions on font name, size, and color need not be applied, since they may be valid for some technical documents, but may not apply to contemporary consumer magazines.
A system and method herein provide a novel homogeneity measure based on line space and a bottom-up region growing approach utilizing both the line space and font size measures. A system and method herein can be used to segment text from documents such as those depicted in
The text segmentation described herein facilitates grouping of text into visually homogeneous blocks. A system and method herein facilitates extracting text from image and graphic components using existing PDF libraries. A system and method herein can be applied to text that follows horizontal reading order and is laid out as horizontal lines. In a system and method herein, local consistency need not be assumed between rendering order and reading order.
As depicted in
The operations in block 205 of
A PDF library and application programming interface (API) can be used for rendering and retrieving text attributes. A given document page can be opened and a WordFinder (PDWordFinder) created. Words (PDWord) and quads (ASFixedQuad) can be accessed via the WordFinder. Visual attributes that can be retrieved include font family, font size, color and bounding box.
In the segmentation, a system and method herein may group text characters of the document into units called quads. The quads are not necessarily the same as the words of the document. Words of the document may be identified as being comprised of one or more quads. For example, an upright word may have only one quad for all the text characters that make up the word. An upright hyphenated word may be identified as having two or more quads. If a word is on a curve in a document, it may be identified as having a quad for each character, or it may be identified as having two characters or more per quad.
The operations in block 210 of
In an example, the line-forming process proceeds by picking up a quad that has not been assigned a line identification to start a new line segment. The line segment is extended left and/or right by adding qualified quads to the growing line segment. When no qualified quad can be added to the line segment, a new line segment is started until all quads are assigned a line identification.
Criteria that can be applied to judge if two quads can be merged are as follows. An example criterion is the vertical overlap. The vertical overlap between two bounding boxes can be determined to be large enough such that:
O(qi, qj)>ko·min(hi, hj)
where O is the vertical overlap, h is the height of a quad, and k0 is the threshold value (i.e., their corresponding quads) horizontally. In a non-limiting example, k0 can be set to about 0.4. Another example criterion is the font size. The font size difference between the two quads can be determined to be small enough such that:
Δ(fi, fj)<kfh
where f is the font size and kfh is a threshold (a maximum relative font size difference for horizontal merge). In a non-limiting example, kfh can be set to about 0.4. Another example criterion is the space. The space between the two quads can be determined to be small enough such that:
d
i,j
<k
dq·min(fi, fj)
where di,j is the horizontal distance between two quads, and kdq is the maximum space between horizontal words (i.e., their corresponding quads) to merge. In a non-limiting example, kdq can be set to about 0.6. For text with horizontal reading order, text merging in the horizontal direction can be performed first. Two quads (including two words) can be merged if their horizontal distance is closer than a threshold value and meets the criteria described above.
Weighted-averaged font size and vertical center line may be used as the attributes of a line segment. The vertical center line of a line segment provides an indication of the position and extent of the line segment. Taking possible text variations within a line segment into account, these two attributes can be computed using weighted averaging. As a non-limiting example, the attributes of weighted-averaged font size (fL) and vertical center line (yL) can be computed as follows:
where fi, yi and wi are the font size, the vertical center, and the width of each quad i, respectively. The vertical center (yi) of a quad i is determined based on the dimension and location of the bounding box of the respective quad i. The width of each quad (wi) is used as the weighting factor in the computation.
The operations in block 215 of
A homogeneity measure based on line space can be used to determine the extent (i.e., block boundaries) of a text block by detecting a change in the line space between pairs of line segments in a portion of the document. If a change in line space is encountered, this can indicate that a new text block should be formed. Thus, the extent of the text block can be determined based on identifying a change in line space.
A homogeneity measure based on font size can be used to determine the block boundaries of a text block by detecting a change in the font size between pairs of line segments in a portion of the document. If a change in font size is encountered, this can indicate that a new text block should be formed. Thus, the extent of the text block can be determined based on identifying a change in font size.
From a given line segment i, a text block recursively can take in a new line segment j with the following conditions. A first condition is based on a horizontal overlap that provides an indication of how much the horizontal extent of one line segment overlaps with the horizontal extent of another line segment in the vertical direction. Line segments are grouped if the horizontal overlap between the two line segments is taken to be non-zero. As a non-limiting example, two adjacent line segments in different columns may be determined to have zero horizontal overlap. In the illustration of
A system and method herein can be used to detect block boundaries during region growing. In detecting a block boundary, two measures may be applied. A homogeneity measure that can be applied may be based on line space. Where a change of line space alone may indicate a block boundary, a measure of relative difference between the two line spaces can be defined as: Δ(di,j, di,h), which is independent of font size. The relative difference between two line spaces can be computed according to Eq. (1). Line space parameters di,j and di,h are illustrated in
Using the line space homogeneity measure and the font size homogeneity measure, the block boundary as well as the type of boundary can be detected as follows:
where Bi is a flag indicating whether line segment i is a boundary line and its type, wf is a weight emphasizing either font size or line space, and {circumflex over (d)}i,h and {circumflex over (d)}i,j are normalized line spaces di,j and dh,i: {circumflex over (d)}i,h=di,h/max(di,h, di,j), {circumflex over (d)}i,j=di,j/max(di,h, di,j). In a non-limiting example, wf can be set to about 2.0. Boundary type “1” is used to indicate “top-down”, or that line segment i is closer to line segment j than to line segment h. On the other hand, boundary type “−1” is used to indicate “bottom-up”, or that line segment i is closer to line segment h than to line segment j.
Non-limiting examples of boundary detection and the segmentation are shown in
After boundary detection, growing text blocks to facilitate text segmentation can be accomplished using region growing in the vertical direction (both up and down). Two neighboring line segments i and j with non-zero horizontal overlap and no other text between them are evaluated. For example, the line segments h and i in
In the example of
An example method and associated algorithm for performing the segmentation is described. A non-limiting example of a method for performing the segmentation can be performed according to an associated algorithm is included in Appendix A.
Examples of the parameters used in the algorithm in Appendix A are listed in Table I.
The threshold kdq can be set low. In an example to accommodate a document having narrow column spaces in the pages, the threshold can be set to about 60% of font size, which deploys lines as column separators. A low threshold can cause more text line segments to be fragmented. The algorithm can achieve very satisfactory results on documents with different layout formats and different column spaces.
In an example implementation, precise quantitative evaluation for the segmentation of the document uses ground truth, which can be time-consuming and may involve some user-applied judgments. In another example implementation, content text blocks and captions can be counted and the corresponding segmentation results inspected. In an example, advertisement pages may not be counted. In another example, titles, tables and maps may not be counted. For example, for the example documents of
Provided herein is a systematic method for text segmentation of documents, including PDF documents. A system and method herein provide a novel measure of line space and novel boundary detection based on combined relative differences of font size and line space. In an example, a method that is localized in nature can provide better results as compared to a technique that is associated with a global or top-down algorithm. A system and method herein can be applied to contemporary consumer magazines that contain complex layouts.
Referring now to
Referring now to
The preceding description has been presented only to illustrate and describe embodiments and examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.
Many modifications and variations of this invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific examples described herein are offered by way of example only, and the invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.
As an illustration of the wide scope of the systems and methods described herein, the systems and methods described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise.
Number | Date | Country | Kind |
---|---|---|---|
PCT/US2011/046063 | Jul 2011 | US | national |
This application claims benefit of U.S. Provisional Application No. 61/406,780, filed Oct. 26, 2010, U.S. Provisional Application No. 61/513,624, filed Jul. 31, 2011, and International Application No. PCT/US2011/046063, filed Jul. 31, 2011, the disclosures of which are incorporated by reference in their entireties for the disclosed subject matter as though fully set forth herein.
Number | Date | Country | |
---|---|---|---|
61406780 | Oct 2010 | US | |
61513624 | Jul 2011 | US |