A document generally has a structure associated with it which defines a layout for and visual characteristics of content items of the document. The structure can be explicitly defined and recorded using a machine readable language such as a markup language for example. Alternatively the structure may be implicit, or only partially explicit. That is to say, none, or only a portion of the structure of the content of the document is explicitly defined and recorded. The different visual cues present in a document—such as spatial intervals and positions, contrast in font families, sizes and weights—combine to form the document's visual hierarchy. This hierarchy is essential to the reader, allowing scanning and comprehension; in contrast, this information is often ignored by machine processing.
A document may be repurposed—that is, the layout and/or certain of the characteristics of the content items constituting the document may be altered (in order to provide a different look for example, or to tailor the document to a specific audience or for a particular use). Automatic repurposing is relatively straightforward when the content items the document comprises of have a well defined and explicitly recorded structure—in such cases automatic repurposing can occur using the recorded structure and the document content without intervention. However, for documents in which there exists no, or only a partial definition of the structure of content items from which the document is composed, automatic repurposing is a more difficult task which has not been addressed since it is not generally possible for a computer implemented system to correctly and consistently determine the elements which make up the document, the way in which they should be repurposed, and to maintain a visual significance or hierarchy between original and repurposed elements which correctly reflects the relative importance associated with those elements in the document.
Various features and advantages of the present disclosure will be apparent from the detailed description which follows, taken in conjunction with the accompanying drawings, which together illustrate, by way of example only, features of the present disclosure, and wherein:
a is a schematic representation of the layout of a portion of a first document;
b is a schematic representation of the differences in aspect values between several different fonts;
a is a schematic representation of the alignment of two different characters from the same font family which depicts the difference in alignment between characters;
b is a schematic representation of the difference in the perceived size of two different fonts of the same font size;
c is a schematic representation of the difference in perceived size between two different fonts at different capitalizations;
a is a diagram representing a difference in the width of a stroke for the same font-size (A and B), but different weights: A normal, B bold; and the same width of a stroke B and C;
b is a schematic representation of a linear mapping of properties according to an embodiment between a first document defining a domain and a second document defining a range;
c is a schematic representation of a mapping according to a strict style;
In the first document 101, the content is only implicitly (or at best, partially explicitly) defined such that there is little or no defined machine-readable structure associated with the document which can be used to determine the individual content items from which the document is composed, and their associated characteristics—that is to say, there is no, or only a partial definition associated with the document for mapping particular content into a particular formatting defined by the style for the document, and in particular, there is no (or minimal) computer-readable structure associated with the document; structural information is implicitly conveyed by the deployment of different fonts, colors, graphical objects and spacing for example.
In this regard, the problem of automatically repurposing the content 103 of the first document 101 using a formatting style which is different to the style originally associated with the first document in order to provide a new document with a second style is non-trivial.
According to some embodiments, an implementation for repurposing a first document can be divided into three stages: the extraction and deconstruction of the input document (or documents), the mapping of styles, and the reconstruction of the content with its mapped style. A style for repurposing the first document can be known, or can be determined from a second document.
According to some embodiments, a document repurposing system that can operate on non-structured or partially structured documents proceeds by satisfying, amongst others, the following axioms:
The fact that two blocks of texts have the same visual appearance (properties) generally means that these two blocks are likely to be related: they might be siblings or might have the same level of importance for example. Such a relationship is generally to be preserved unless otherwise stated by a user or required by any partially available structural information for the document.
According to an embodiment, words can be formed from individual characters of the input document(s), then words can be joined into lines and lines into blocks of text using the strategy proposed in H. Chao and J. Fan, Layout and content extraction for pdf documents, In Lecture Notes in Computer Science, Document Analysis Systems VI, pages 213-224, Berlin/Heidelberg, 2004, Springer-Verlag, the contents of which are incorporated herein by reference. The general scheme is shown in
The aspect value of a font as shown in
Referring to
Each Bi is a tuple containing a homogeneous piece of text Ti and its associated properties: Pi={fi,si,vi,}, where fi, si and vi are the font family, font size and variant (bold, italic, underscore, colour, etc.) respectively. By homogeneous we mean that all characters inside Bi are of the same property Pi. We therefore have:
B=∪Bi,(the overall text is the union of partitions)
Bi={Ti,Pi},
Ti∩Tk=Ø, if i≠k Eqn. 1
Any two adjacent blocks Bi and Bi+1 differ by at least one of the properties, such that for any i, P1≠Pi+1. However, non-adjacent blocks may have the same properties such that for any i, k, where |i−k|>1, the equality Pi=Pk is possible. An example of such a situation is where all section headings are bold and have the same font family and size.
At step 203, a set of all distinct non-repetitive combinations of properties in the document is extracted. That is to say, if we let P={Pq} be the non-repetitive set of all distinct combinations of properties in a particular document, then if PqεP and PnεP, Pq≠Pn.
At step 205 the perceptual average significance (hierarchical score) of characters is calculated/assigned for every PqεP. This score is designed to capture the relative location of the property in the document's visual hierarchy. The hierarchical score of a property is determined by contrasting its characteristics with those of other properties: comparative size and weight of their characters.
At step 207 a set of all unique (non-repetitve) properties {Pq} is extracted. A perceptual significance/hierarchical score is assigned to each Pq: hq=H(Pq), where hq is a visual significance/perceptual hierarchical score assigned to the block(s) of text with property Pq. This value can be initially computed using the actual heights of characters in a particular font family, for a particular font size. Extra correction can be added to accommodate properties such as “bold”, “all caps”, etc. for example. Referring to
In the example of
H(P1)<H(P2)<H(P3)<H(P4)
Accordingly, there are some relationships for the same font family which can be easily determined, for example, the larger the font size, the more significance/higher hierarchical score is associated; adding “bold” or “all caps” will add more weight and should result in higher hierarchical score; “italic” adds less weight than “bold”, etc. However, relationships between different font families might not be so straightforward. As a first approximation, one can take the average font heights and augment this with the corresponding variants (e.g., bold, italic and bright colour). For example, Verdana, 12 pt is visually “larger” than Times New Roman, 13 pt. According to an embodiment, it is sufficient to establish these relations between different font families only once and, for example, to speed up the repurposing tool a lookup table can be used to store these values.
The steps above generate an ordered list of average visual significance/perceptual hierarchical score for a first document. The domain values are the ordered values of average visual significances/hierarchical scores of properties for the first document. A second document has an associated ordered list of properties defining a range. This list for the second document can be a list which is predetermined and known, or one which has been determined using the method described above, but will generally be a list which comprises at least one property that is different than the list for the first document. The ordered list hierarchical scores for the second document defines a range of values. Different strategies can be deployed to establish a mapping M, which maps the properties of the first document to the content of the second document in a way which preserves the visual hierarchy of the properties associated with the content of the second document.
According to an embodiment, if the number of different properties in the range and domain are equal, then direct mapping can be used such that the i.sup.th property in the domain is mapped to the to ith property in the range. If either of the documents comprises an incomplete representation of its style (i.e. it does not contain every permitted property) or if the numbers of properties in the respective domain and range are different, a different approach can be used (for example a linear or piecewise-linear transformation can be applied as the mapping between the domain and the range). For the more generic situation in which the number of properties in the domain does not match the number in the range, a two step approach is implemented according to an embodiment (step 209):
Using a linear transformation provides an intuitive initial mapping that preserves visual order and proportionality, i.e. adheres well to the above stated axioms. The transformation brings both the domain and range values into the same scale, whilst preserving perceptual order and proportionality between the different properties from domain. An alternative transformation may be used to take into account the amount of characters associated with some properties Pq.
According to an embodiment, an additional working hypothesis is that the most frequently occurring property Pq is likely to be attributed to the main body in a document. Under certain circumstances a piece-wise linear transformation can be used to ensure that the most frequently occurring property in the domain is mapped to the most frequently occurring property in range. Other injective and preferably preserving order functions can be used. Some values obtained using such mappings (or similar) may be not be sufficiently close to the actual values in the range, thus requiring further local (fine-tuning) adjustment.
At step 211 local (fine-tuning) adjustments to the final formatting style are applied. Once the first level mapping M between the domain and range is completed, the final adjustment of the mapped values to the formatting style of the second document is carried out. A secondary adjustment F, can take the form fq=F(M(Pq)). This is done to ensure that the final values fq are sufficiently closed to the values permitted (tolerated) by the original style of the second document.
According to an embodiment, a transformation proceeds by shifting mq=M(Pq) values to the nearest (either above or below) available values in the range, if such values are available. If the end style is not strict and there are not enough values in range, some extra values can be automatically added or at least suggested. Such suggestions can be approved by human interaction if desirable.
In some cases user interaction might be required. For example, if not enough values are provided and adding automatic values are not permitted (for example the original style of the second document is strict and all available properties are clearly stated). In this case, user guided merges can occur. Machine learning techniques can also be deployed to “remember” past user interactions to be used in the future. This case means that we repurpose the document whose style is richer than the final style and the final style is strict. In such circumstances inevitably some different properties should loose their distinction and become identical. For example, two nearest properties could be selected, but it could be a ‘creative decision’.
Other adjustments can also be made based on:
At step 213 the first document is reflowed using the extracted and assigned properties from the second document. The final mapping should match as accurately as possible the properties provided by the original style of the second document and at the same time adhere to axioms described above. At the same time two blocks having different appearances are indicative of either an unrelated nature between the blocks or different levels of importance, and this relationship is also preserved.
A prime factor in asserting a dominant position for a block of text within a design is the use of contrast in properties: larger, bolder characters will assume the dominant role. The subjective apparent size of text is affected by the perceived height of letters in a block. In turn, perceived height depends on multiple factors: font-family and size of the individual characters within a block and character case for the block as a whole. Historically, font-size did not differ between font-families; it was the descender-to-ascender height of metal-cast letters. For digital fonts, however, font-size blends with other characteristics to create the way letters appear on screen or in print. In digital fonts, characters of different font-families rendered at the same point size often have distinct heights. For example, Arial and Courier New produce characters of different height when using the same font-size. This prevents the use of raw font-size as a proxy for perceived height. Therefore, given that font-size cannot be used directly to calculate perceived height of a block, an alternative approach is used to compute the average height of all characters in a block to produce the perceived height according to an embodiment.
To evaluate perceptual height we refer back to the foundations of typography. A base-line is an imaginary line which all non-descendant characters ‘sit’ upon. To achieve the required optical alignment some characters are drawn touching the baseline, whilst others descend below it. Two characters of the same_font type, style, variant can therefore have different heights; the letter O is especially designed to protrude below the baseline for example. The perceived body height of lowercase characters aligns optically at the x-height, where x-height is the height of a lowercase x character, as shown in
This means ascenders and occasional capitals in sentence and title cased text do not contribute to the perceived height of the text block. Instead these features add to the block's outer shape to aid legibility, see for example
For any given font-size, fonts having a larger x-height appear larger than those with smaller x-heights, as illustrated in
When cap-height is identical to x-height, a block set in all-capitals appears smaller than a corresponding mixed-case block, as illustrated in
A second factor in asserting a dominant position for a block of text is the use of a different character weight for the block. Font weight can be defined as the ratio between the width of a character's stroke and its height. Most type families consist of four weights: light, regular, medium and bold. For determining hierarchical ordering, the trend is relatively simple: a wider stroke produces more dominant text, for a given font-family and size. Bold text is used in two primary cases, as follows: Firstly, unlike larger font sizes, bold can be used to emphasize elements ‘in-line’: it can be mixed with normal text of the same size without breaking the optical lines described above. The effect of in-line emphasis cannot be achieved using larger fonts.
When a bold (or other different font-weight) is used for in-line emphasis and is detected according to an embodiment such that the bold (for example) font is adjacent to at least one text block with its style property differing by weight, then it is assumed that the bold font is being used to create in-line emphasis.
Secondly, font-weight variation is often used in headings. Bold is used to compensate for a relatively small increase in x-height between adjacent headings of different levels or between a heading and its following text. In this case the block is likely to be surrounded by line breaks and neighboring text blocks with distinct properties; this can be used as a heuristic to detect when bold has been used to signify a heading rather than in-line emphasis. These two cases need to be differentiated to ensure a property describing in-line emphasised text is not mapped to an inappropriate property during repurposing, such as a heading style.
Accordingly, hierarchical score depends on a text block's x-height and the weight of its characters. In order to determine a correlation between these factors, there are two edge cases that are first considered:
Referring to
Given the cost of calculating the equilibrium point for all required combinations in a given document, it would make sense to look for alternative methods of determining hierarchical scores. In both these contexts, a heuristic approach can be applied which is described below. To see this, the two contexts where extra weight is used are restated:
In body text, use of bold is in-line and therefore below any heading in terms of the visual hierarchy of the document. When used in-line, bold text blocks will share all property characteristics aside from font-weight with surrounding blocks. This means that the two properties, that of ‘body text’ and ‘body text+bold’, can be carried through the transformation together, preserving the relationship between ‘body text’ blocks and ‘body text+bold’ blocks. This is in-line with the requirement that similar properties are mapped in the source document to properties similar to each other in the target document.
If the second case, when bold is used in headings, order of appearance can be used to infer hierarchy; more important headings will appear before lower-level headings. This can be expanded more generally: for the first appearance of two parametrically competing headings of text, in the absence of dominant properties, hierarchical order can be assigned in order of the blocks' appearance. So instead of requiring a combinational method to decide the hierarchical score when considering competing x-height and weights, we can use the above intuitive methods.
The problem of document repurposing in perceptual space becomes the problem of monotonic mapping between two linearly ordered sets: the property hierarchical scores of two documents. The mapping should be established for every unique property from the first document to be repurposed. No two properties from the first document should map into the same property of the second document (unless in the special case that there is a strict style—i.e. a style which must be adhered to—for the second document which has an insufficient number of properties).
The steps above generated an ordered list of hierarchical scores for the properties of the first document to give a domain. Providing a second ordered list for a second document provides a range. Different strategies can be deployed to establish the mapping M. In the simplest case, if the number of different properties in Range and Domain are equal, then direct mapping can be used: ith property in the Domain to ith property in the Range. However, this mapping cannot be used if either document does not contain a complete representation of its style or the quantity of properties in the Domain and Range differ. For example, if two documents have styles perceptually appearing as headings and body text as follows:
Using a linear transform provides an intuitive initial mapping which adheres well to the repurposing requirements stated earlier by preserving visual order and proportionality. Linear transformation is also likely to overcome the problem described above. The transformation brings the Domain and Range values into the same scale, as shown in
Let minimal hm and maximal hM perceptual values for both documents be given by:
hm(i)=mink(Hk(i))
hM(i)=maxk(Hk(i))
then the simplest transformation f(H) preserving order and proportionality is the linear mapping:
The basic model is reliant on the minimum and maximum points being mapped correctly in the transformation between the first and second documents. This means special care needs to be taken to ensure a correct mapping by using structure-based heuristics. For each hierarchical score in the first document the corresponding hierarchical score from the range, provided by the second document is established. However, the newly computed hierarchical metrics might not exactly match the metrics from the style to be applied to the first document. Next, we need to reflect the newly computed hierarchical scores into the actual properties p(2) of the second document.
As a result of the linear mapping, hierarchical scores of both documents are now on the same scale. According to an embodiment, a one-dimensional Nearest Neighbor Search, or similar, can be used to establish the correspondence between fk and the nearest Hj(2)j preserved order whilst aiming to maintain proportionality to a certain degree.
Given transformed scores of the first and second documents, a correspondence can be established using different strategies depending on types of documents. For example:
select the nearest
if there is double booking, shift minimizing total error
When mapping for every fk is established, a check is performed to determine if there are no two different fk-s that both mapped into the same H(2)j as for example shown in
This is the final stage in establishing mapping between sets of properties P(1) of a first document and P(2) a second document. The next step is to update the corresponding properties of text blocks B(1). Each homogeneous text block B(1)j from the partition is described by the newly assigned property P(2)j from the style to be applied. Due to these modifications the original positions of text blocks B(1)j can no longer be used and the first document must be reflowed: characters, line breaks and graphic content need repositioning. The Document Description Framework (DDF) platform as described in J. Lumley, R. Gimson, and O. Rees, A framework for structure, layout & function in documents, In DocEng '05: Proceedings of the 2005 ACM symposium on Document engineering, pages 32-41, New York, N.Y., USA, 2005, the contents of which are incorporated herein by reference can be used to reflow the document. A flow-based document can be presented as a spatially ordered sequence of text blocks B(1)j periodically interrupted by graphic content, line breaks or both. By spatial ordering we mean the position of each B(i)j on a page is fully defined by the position of its predecessor B(i)j-1. If there are no graphics or line breaks after B(i)j-1, then B(i)j is placed immediately after B(i)j, otherwise corresponding graphics and line breaks are placed, followed by B(i)j.
The linear model can be further improved by using spatial relationships between different elements and frequency analysis:
After applying these heuristics, it is likely the basic linear model will have been interrupted because the heuristics will have required some points in perceptual space to have been mapped onto heuristically defined corresponding points, rather than those defined by the linear model. For example, according to Heuristic 2, fonts of the main body of the second document are applied to the corresponding part of the first document.
If there is no corresponding element in the first document (such as a footer or table), but there is in the second then nothing in the first document should be mapped in to the corresponding properties of the second document. That is to say, for example, that something should not be created which looks like a footer, if the first document contains no footer. For example, if a footer is not present in the first document, the corresponding footer formatting according to the style to be applied should not be used. In this particular case, the lowest point in the range is not used by a linear or piecewise linear model.
Once all fixed points have been established, they serve as the vertices V(n)={P(1)j, P(2)k} of the piecewise linear transformation for the remaining properties. The equation for each linear segment of this mapping is given by Eqn. 2 above where coordinates of fixed points are used instead of end points h(i)m and h(i)M accordingly.
Requiring the mapping provided by the piecewise linear transformation remains injective implies that the sequence of all heuristically established correspondences must satisfy the following monotonicity condition:
For all n:V(n-1)x<V(n)x,V(n-1)y<V(n)y
If this condition cannot be met, then any ‘double booked’ properties and ‘reverse order’ fragments are to be verified by the user.
An example of a piecewise linear mapping is shown in
Accordingly, it will be appreciated that the system repurposes documents using a set of rules in order to transform content and visual characteristics of an original document onto a repurposed document such that certain criteria are fulfilled and certain aesthetic rules are satisfied. The rules require that certain general characteristics of the original document to be repurposed are preserved so that these characteristics are recognisable in the repurposed document.
It will be appreciated that a document can be repurposed using a plurality of styles, as opposed to a single style. For example, different content items of a document can be respectively repurposed using different styles in order to provide a repurposed document which is a ‘mash-up’—that is to say, a document comprising a plurality of differently (and perhaps visually incompatible) applied styles.
It will also be appreciated that the style used to reflow a document need not be a known entity, but can be determined using another document. For example, and referring to
At 903, the documents are partitioned into homogenous blocks of text B(i)j. An example of a partitioned document is shown schematically in
At step 909, a mapping is established between the sets of properties for the documents, which mapping is adapted to transform the content of Document1 using the style of Document2. More specifically, an injective mapping is established, if possible, using a linear or piecewise-linear transformation. An injective mapping is appropriate in the case that the number of unique properties from Document1 equals or is less than the number of unique properties extracted from Document2. In the case that the number of properties from Document1 exceeds the number from Document2, then according to an embodiment, some additional properties from the style used for the repurposing can be generated in order to maintain an injective mapping. In the case that the style used for the repurposing is a strict style, in the sense that it is not possible to validly or easily create new properties, properties can be merged. The merging of properties is least desirable mechanism to ensure a valid mapping.
At step 911 Document1 is repurposed/reflowed according to the extracted style from Document2 subject to the extraction and inclusion of any line breaks, margins, order and graphical objects at step 919. At the point at which the documents are deconstructed, the axioms described above can be evaluated at step 915. Accordingly, fixed points within the document to be repurposed are determined at step 917, and the fixed points are used in the determination of the mapping at step 909.
It is to be understood that the above-referenced arrangements are illustrative of the application of the principles disclosed herein. It will be apparent to those of ordinary skill in the art that numerous modifications can be made without departing from the principles and concepts of this disclosure, as set forth in the claims below.
Number | Name | Date | Kind |
---|---|---|---|
6240407 | Chang et al. | May 2001 | B1 |
7721105 | Herbach | May 2010 | B1 |
20030115551 | Deleeuw | Jun 2003 | A1 |
20050108636 | Sylthe et al. | May 2005 | A1 |
20060155699 | Purvis et al. | Jul 2006 | A1 |
20070192687 | Simard et al. | Aug 2007 | A1 |
20070208996 | Berkner et al. | Sep 2007 | A1 |
20080037873 | Berkner et al. | Feb 2008 | A1 |
20080189600 | Lau et al. | Aug 2008 | A1 |
Number | Date | Country | |
---|---|---|---|
20100199168 A1 | Aug 2010 | US |