The present invention relates generally to systems and methods for republishing digitized documents.
Some embodiments of the invention are described, by way of example, with respect to the following figures:
Digital content creation and conversion is a significant activity in modern times. Not only are existing digital files and documents being created and saved, but new digital information is being created from other non-digital information mediums, including contemporary and historic book and magazine collections fixed in paper and previously stored in libraries, vaults, and the like. As digital copies of books become available they can be used for online viewing, searching, reprinting etc. A common technique for digitizing such documents is to scan them using scanners or digital cameras.
There, however, is a need to ensure that such digitization efforts can be commercially viable. For example, the costs of scanning, cleanup, storage and bandwidth are the key inhibitors for making all books available online. By providing a method for monetize the viewing of these books, content owners will be encouraged to support the initial investment needed to bring the books on-line, and would help subsidize the cost of reprinting such books, magazines, and other documents.
One of the most effective and proven methods of monetizing online content has been to embed the advertisement in the content, thereby making it an integral part of the content. Typically hiss is done manually by the content owner through: carefully constructing the pages such that the ads appear properly embedded and flow with the content; inserting fill pages ads between pages of content; and/or placing ads outside of the page content but within the web page. Such manual efforts not only delay creation of derivative works, but are also costly and tend to be more rigid and inflexible with regard to the advertisement displayed with the content.
The present invention addresses and remedies many, if not all, of the problems discussed above. The present invention describes techniques for automatically embedding (i.e. adding) advertisements and other new content with the original content.
One key benefit of the present invention enables new content to be added to the original content without requiring prior knowledge or control of the original content's layout. Thus, given a collection of scanned pages containing variable amounts of content, the present invention automatically determines, from a finished document output size and available new content (e.g. advertisements), which of the finished document's pages can host new content, and where such new content can be placed.
Such automatic embedding of new content also enables greater flexibility during online viewing, or searching, as well as when the original content is reprinted since different sets of new content (e.g. advertisements) can be added to the original content each time. This is an important advantage over traditional ad placement methods.
Details of the present invention are now discussed.
A “document”, which is subsequently digitized, is herein defined to include any medium of expression, including books, magazines, photos, images, video, media, or any other medium capable of being digitized. Note that while the invention will be discussed primarily with reference to a document which is a book, the teachings of the present invention also apply to these other document types.
A page detection module 104, within the system 100, receives the digitized document 102 from a source such as a storage device, a scanner, a digital camera, or other hardware. The digitized document is wholly or partially formatted as an image file. Image files include either pixel or vector (geometric) data that are rasterized to pixels when displayed. Raster formats include: JPEG, TIFF, RAW, PNG, GIF, BMP, PPM, PGM, PBM, XBM, ILBM, WBMP, and PNM. Vector formats include: CGM, and SVG.
The image format as defined herein, does not in itself (i.e. by its format coding) separately give meaning to different portions of the digitized document 102. For example, the image format would represent any text within the digitized document 102 using a same set of format rules (e.g. perhaps by assigning a gray-scale, brightness, and/or color code to each pixel in the digitized document 102) as any other portion of the digitized document 102, such as a margin region.
The page detection module 104 uses known techniques to distinguish a digitized document page 202 (see
An original content identification module 106 then receives the digitized document page 202 from the page detection module 104 and identifies original content 204 (see
The original content identification module 106 preferably uses known techniques to automatically distinguish the original content 204 from the digitized document page 202 (see
The “Original Content” 204 is herein defined as that portion of the digitized document page 202 which the system 100 has been tuned to select for inclusion in subsequent derivative works. In one embodiment, such content includes typed text, illustrations, and/or photos on the page of a book. In another embodiment, such content includes typed text plus margin notes, perhaps scribbled by a prior reader of the book. Thus what constitutes the original content 204 can vary from digitized document 102 to digitized document 102.
The original content 204 is typically automatically identified as a rectangular region surrounding text, photos, etc. in the digitized document page 202. Those skilled in the art, however, will recognize that the original content 204 could also be of a different shape, depending upon the original content 204 to be used for a later derivative work.
In one embodiment of the present invention, the original content 204 is identified using as much of the information that was originally captured in the digitized document page 202, while it is still available. Identifying the original content 204 before the digitized document page's 202 background color is removed and/or overall image is enhanced enables the content, in some cases, to be detected more effectively. This is in part because some automated methods for detecting the original content 204 often use background color information as well as other information otherwise lost due to image enhancement to distinguish content from the digitized document page 202.
Some embodiments of the present invention use an image enhancement module (not shown), which analyses the digitized document page 202 to compute an original background color of the digitized document page 202. Then, the image enhancement module uses this information to remove the original background color from the digitized document page 202.
A new content space identification module 108 calculates a blank space 304 (see
In the embodiment shown in
Next, the new content space identification module 108 defines a set of new edge margins 402 and a set of new inter-content margins 404 (see
For example, if the new document page 302 target page size is a 6 inch by 9 inch format, the edge margins 402 for the top and bottom could be between 7/10 of an inch to ¾ of an inch. The edge margins 402 for the left and right side could be between ½ an inch to 6/10th of an inch.
In many embodiments of the present invention, the original content 204 will remain in the same place on the new document page 302 as the original content 204 occupied on the digitized document page 202. In other embodiments the predetermined overall layout may require that the original content 204 be moved to a different location on the new document page 302.
Typically, the set of edge margins 402 between the original content 204 and the edges of the new document page 302 will be the same as an original set of edge margins between the original content 204 and the edges of the digitized document page 202. However, the predetermined overall layout will likely specify a different set of inter-content margins 404 instantiated between the original content 204 and what will thereby by default be defined as a set of new content spaces 406. The set of new content spaces 406 (e.g. bounding boxes) are identified by the new content space identification module 108.
To further clarify, the set of new content spaces 406 (see
The new content space identification module 108 preferably identifies most, if not all, of the new content spaces on new document pages 302 throughout a finished document 118. The new content space identification module 108 then characterizes each of the new content spaces 406 by a variety of attributes, including; a location in the digitized document 102; a location in the finished document 118; a location on the digitized document page 202; a location on a finished document page 504 (see
The new content space identification module 108 then stores a list of these new content spaces 406, and their attributes, for each of the digitized documents 102 in a new content space database 110.
Next, a new content addition module 112 searches a new content database 114 for new content 502 which is compatible with one or more of the new content spaces 406 (see
The new content 502 can be of any type, including those identified with respect to the original content 204. These types of new content 502 include: text, images, photos, media, videos, decorations, ornamentation, or any other type of content. In the present embodiment discussed, the new content 502 is a set of advertisements.
All of this new content 502 is stored in the new content database 114 by the new content providers 116. The new content 502 is typically dynamic and will vary over time, as the stock of new content 502 is continually augmented, culled, and modified in a variety of ways by the new content providers 116.
The new content providers 116 preferably have substantial, if not total, control over how the new content 502 is managed by the new content addition module 112. Clearly, by providing new content 502 or not, the new content providers 116 have a basic control over the new content 502; however, more frequently, the new content providers 116 will modify the attributes associated with the new content 502 in some way so as to continually ‘best position” the new content 502 in the finished document 118.
The attributes associated with the new content 502, includes: a payment to be made by the new content providers 116 for placement of the new content 502; a preferred set of locations for the new content 502 within the finished document 118; a preferred set of locations for the new content 502 within each of the finished document page 504; a minimum and/or maximum total area of the new content space 406 permissible for the new content 502; a scaling range of the new content 502 so that it can best fit in a new content space 406; a permissible and/or required set of geometric shapes for the new content 502; a date, time and/or duration over which the new content 502 item is to be displayed; and a derivative work in which the new content 502 will appear.
The new content addition module 112 preferably closely adheres to these specified attributes for the new content 502 when determining if any one item of new content 502 is compatible with any one or more of the new content spaces 406. Such adherence is strongly preferred since the new content providers 116 will in most, if not all, embodiments of the present invention be paying a fee for their new content 502 to be added to the finished document 118. This fee in turn supports businesses who facilitate the process of digitizing documents otherwise inaccessible paper documents.
The new content addition module 112 search, of the new content database 114 for new content 502 which is compatible with the new content spaces 406, can be conducted in a variety of ways. In other words, the new content addition module 112 can sort, group, and/or otherwise characterize both the new content spaces 406 in the new content space database 110, as well as the new content 502 in the new content database 114 in many different ways so as to best select new content 502 for each new content space 406. Such sorting, grouping, and characterizations, are preferably based on the respective attributes of the new content spaces 406 and the new content 502. For example, the new content spaces 406 could be sorted from largest to smallest, and the new content 502 could be sorted from a greatest to a least payment to be made by the new content providers 116.
Then, the new content addition module 112 formats and inserts the selected new content 502 (e.g. New Content—A, B, C, and D, see
The following are several examples of how new content 502 can be selected to fill a new content space 406. In these examples, the new content spaces 406 are in a book for printing on demand, the new content providers 116 are advertisers, and the new content 502 is a set of advertisements. However, in other embodiments of the present invention, the advertisers may limit instantiation of their advertisements to only certain derivative works (i.e. finished documents 118) each having their own unique set of attributes (i.e. “content placement rules”). These other derivative works include: web-pages, books, magazines, presentations, circulars, flyers, labels, and other types of finished documents 118.
To begin, the preferred set of locations for advertisements are typically toward either the front or at the very end of a book. Depending upon the advertisement, the preferred set of locations on each page of may be between two sets of paragraphs in the book, or to the right or left of a “thin” paragraph that does not span the full page width (e.g.
The advertisers may specify a minimum acceptable total area so that the advertisements will be quite visible to a reader of the book. Other advertisers may set a minimum and maximum area limit, which may or may not be a function of the target size of the finished document 118. In some embodiments, the new content space identification module 108 may purposefully delete from consideration all new content spaces 406 which are smaller than an minimum limit (including those new document pages 302 that have no new content spaces 406) so as to avoid cluttering up the finished document 118.
In some embodiments, the advertisers may only permit the advertisements to be scaled (i.e. resized) larger, but only by a certain percentage so that the advertisements can best fit in certain new content spaces 406. Some specialty advertisers may prefer a triangular or star shape for their advertisement.
In many embodiments, the advertisers are likely to specify a range of date, times and durations for which their advertisements will be displayed. A set of the advertiser's ads may even be rotated over a predefined time period, such that the ads are cycled over time for greater variety. Such timing variability has particular applicability to when the derivative work fixed in a web page or cloud document, where as such variability may be less so for a printed on demand book.
The method 600 begins in step 602, by having the page detection module 104 receive the digitized document 102 from a source. In step 604, the original content identification module 106 identifies original content 204 from within the digitized document 102. Next in step 606, the new content space identification module 108 calculates a blank space 304 available in a new document page 302, wherein the blank space 208 is herein defined as, equal to, the new document page 302 area, minus, the original content 204 area.
In step 608, the new content space identification module 108 defines a set of new edge margins 402 and a set of new inter-content margins 404. Next in step 610, the set of new content spaces 406 am identified within the new document page 302 by the new content space identification module 108, wherein the set of new content spaces 406 within the new document page 302 are herein defined as a set of areas remaining after the original content 204 area, the set of new edge margins 402 area, and the set of new inter-content margins 404 area have been subtracted from the new document page 302 area.
In step 612, the new content space identification module 108 identifies most, if not all, of the new content spaces on new document pages 302 throughout the finished document 118. In step 613, the new content space identification module 108 then characterizes each of the new content spaces 406 by a variety of attributes
Next in step 614, the new content space identification module 108 then stores a list of these new content spaces 406, and their attributes, for each of the digitized documents 102 in a new content space database 110.
In step 616, the new content addition module 112 searches the new content database 114 for new content 502 which is compatible with one or more of the new content spaces 406. Next, in step 618, the new content providers 116 pay a fee for adding their compatible new content 502 to a finished document S118. In step 620, the new content addition module 112 inserts the selected new content 502 into the new content spaces 406 in the finished document 118.
A set of files refers to any collection of files, such as a directory of files. A “file” can refer to any data object (e.g., a document, a bitmap, an image, an audio clip, a video clip, software source code, software executable code, etc.). A “file” can also refer to a directory (a structure that contains other files).
Instructions of software described above are loaded for execution on a processor. The processor includes microprocessors, microcontrollers, processor modules or subsystems (including one or more microprocessors or microcontrollers), or other control or computing devices. A “processor” can refer to a single component or to plural components.
Data and instructions (of the software) are stored in respective storage devices, which are implemented as one or more computer-readable or computer-usable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs). Note that the instructions of the software discussed above can be provided on one computer-readable or computer-usable storage medium, or alternatively, can be provided on multiple computer-readable or computer-usable storage media distributed in a large system having possibly plural nodes. Such computer-readable or computer-usable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components.
In the foregoing description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these details. While the invention has been disclosed with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations thereof. It is intended that the following claims cover such modifications and variations as fall within the true spirit and scope of the invention.
This application relates to co-pending U.S. patent application Ser. No. 12/360,807, entitled “System And Method For Removing Artifacts From A Digitized Document,” filed on Jan. 27, 2009, by Reddy et al. These related applications are commonly assigned to Hewlett-Packard Development Co. of Houston, Tex.