The subject application relates to searchable electronic documents, and more particularly to reducing file size of searchable electronic document while improving ability to identify a searched term.
When scanning or otherwise generating searchable electronic documents, information can be stored in a variety of file formats, such as a portable document format (PDF) and extensible markup language paper specification format (XPS), or the like. Some versions of electronic documents are searchable, such that a user is permitted to enter a term, and a software application searches the document and identifies any instances of the text term to the user. However, conventional searchable electronic document systems and methods require embedding one or more relatively large font definition files into the electronic document to enable searching. When the purpose of storing the document in image form, as a PDF or XPS document is to reduce file size, embedding a large font definition file runs contrary to the intended purpose of such applications.
Accordingly, there is an unmet need for systems and/or methods that facilitate overcoming the aforementioned deficiencies.
In accordance with various aspects described herein, systems and methods are described that facilitate minimizing additional font information embedded into a searchable electronic document image using a glyphless font technique. For example, a method of highlighting a searched term in an electronic document image comprises receiving a search query for a term in the document image, reading glyphless font size information embedded in the document image, and determining at least one dimension for a highlight block for the search term from the glyphless font size information. The method further comprises identifying the search term in the document image, and overlaying the highlight block on the identified search term.
According to another feature described herein, a system that facilitates highlighting search terms in an electronic document image comprises a scanner that scans a document and embeds glyphless font size information into an electronic image of the document, a searcher with an optical character recognition (OCR) component that identifies a search term in the document image in response to a user query, and a processor that generates a highlight block of a variable width and overlays the highlight block on identified search terms in the document image.
In accordance with various features described herein, systems and methods are described that facilitate mitigating searchable electronic document size increases associated with embedded font definition files by embedding only font size information. For example, scanned document size, when stored in PDF or XPS format, increases when an optical character recognition (OCR) technique is employed to search and/or identify terms in the document. Typically, all fonts referenced or used in the document are stored with the document to facilitate such searches, which contributes substantially to document size. For instance, each embedded font definition file can add hundreds or thousands of kilobytes to the document size. Such size increases are undesirable when considering that the font definition file is so large compared to the compressed document image file size. Accordingly, systems and methods are described herein that facilitate embedding only font size information, using a “glyphless” font.
With reference to
The dummy font is an empty font that contains only font size information but does not render text. Rather, the dummy font only renders the space that the text would occupy, in a different color than the background image, without rendering the text characters itself. Accordingly the systems and methods described herein have application in for document images in any language, and can be especially useful, for example, in PDF or XPS document image searching when such documents are in languages not supported PDF or XPS.
The system 10 comprises a scanner 12 that receives a document and generates a scanned document image (e.g., a PDF image, an XPS image, or some other scanned image type). In one example, the scanner 12 is a physical scanner that receives a physical document and generates an image thereof. In other examples, the scanner 12 is a software-based scanner, such as a conversion program that converts an electronic document from a document generation application format (e.g., a word-processing application, a graphical drawing application, or the like) into an electronic image document. In either case, the scanner 12 is coupled to a document memory 14, which stores a document image 16 containing glyphless, or dummy, font information 18 describing the font(s) used in the scanned document.
The system 10 further comprises a user interface 20 such as a computing device (e.g., a personal computer, a laptop or tablet PC, a PDA, a cell phone or the like), which displays information to a user and into which the user may enter information. According to an example, the user views the document image(s), representing pages of the physical document scanned into the scanner 12, on the user interface. The user enters a search query for a given word into the user interface 20, which is received by a searcher 22 that executes one or more algorithms to search the document image 16 for the queried term. For instance, the searcher 22 can include an OCR component 24 that recognizes text words in the document that match the queried term and returns results to the user interface 20 for presentation to the user. The identified query results from the OCR component 24 can then be efficiently highlighted by the searcher 22 for the user using the dummy font information to determine an appropriately sized highlight block to overlay on the compressed text image representing the identified query result. In this example, the OCR component 24 analyzes beginnings and ends of characters in a bitmap. The embedded glyphless font includes size and scale for each font used in the document image, and the OCR component reads the font size information for scaling for the searcher 22. Glyph information is not needed because the searched term is not being rendered, but rather is being overlaid with a highlight block of appropriate size.
In another example, the glyphless dummy font information comprises information describing character width and height for bolded characters in addition to non-bolded characters. According to still other features, the dummy font information includes rotation information that describes an angle or degree if slant for italicized text characters. In this manner, the system 10 facilitates permitting a user to search the scanned document image for a word or other text, and to receive highlighted search results without requiring large font definition files to be embedded in the document image.
The searcher additionally comprises one or more computer-executable algorithms 32 for highlighting query results for presentation to the user via the user interface 20. For instance, a processor 34 can execute an algorithm for identifying a width of a highlighting block that is overlaid on the searched term to identify an occurrence of the searched term as a result to the user. In this example, the algorithm can involve identifying a width and/or height for each character in the queried term, and can sum the widths to determine a length for the highlight block. According to one aspect, the height of the tallest character in the queried term is used the height of the highlight block. According to another aspect, height of the highlight block is determined as the difference between the highest point of a tallest character in the queried term and a lowest point of any character in the queried term, wherein the lowest point of a character may be below a baseline upon which the search term rests. For example, if the term “glyphless” were queried, then the height of the block could be defined as the distance between the top of a tallest character (e.g., an “l” or “h”) and the bottom of a sub-baseline character (e.g., a “g,” “y,” or “p”). According to yet another aspect, the algorithm employs a height for each individual character to form-fit the highlight block to the query result term.
In other examples, algorithms are provided that adjust the highlight block according to detected conditions, such as format (e.g., bold, italics, etc.) of a search result. For instance, a bolded word will have a greater width than unformatted text, and the highlight block can be adjusted accordingly to fit over the bolder search result. Similarly, italicized search results will be slanted slightly, and a rotation angle or slant can be applied to the highlight block to alter it from a substantially rectangular shape to a parallelogram or the like, in order to form-fit the highlight block over the queried search result. In this manner, text can be highlighted in the image document by overlaying the highlight block accurately over the compressed image of the searched term without rendering the text and without overlapping characters that are not part of the searched term.
The searcher additionally comprises a memory 36 that stores user query information, query results, and any other information suitable or related to performing the various functions described herein. For instance, the memory 36 can temporarily store glyphless font information 18 read from the document image 16 being searched at a given time. Such font information can then be erased or overwritten when the document image file is no longer open.
At 76, dimensions for a highlight block that will fit the search term are determined. According to one example, a width dimension is determined for the queried term, such as by adding the widths of individual characters in the term. Space between characters in the queried term may be accounted for as well, and may be added to the aggregate width of the characters in the term. Additionally, a small buffer width may be added to the determined width dimension, so that the highlight block extends slightly beyond the search term when overlaid thereon. A height dimension for the highlight block can also be determined at 76, and may be predefined for a given font size, or may be determined as a function of a tallest character in the queried term. In the latter case, the entire highlight block can be generated at or near the height of the tallest character in the queried term, forming a substantially rectangular highlight block. In another example, the highlight block can be form-fitted to the queried term.
At 78, the instances of the queried term or phrase can be identified in the document image. Queried terms can be identified using, for example, an OCR technique or the like. According to some aspects, acts 76 and 78 may be performed in reverse order or in parallel, depending on processing capabilities, system requirements, user or designer preferences, etc. At 80, search results (e.g., instances of the queried term) are highlighted by overlaying the generated highlight block on the document image without rendering the underlying text characters. In this manner, a searchable electronic document image can be maintained at a small compressed file size while providing enough font information to permit highlighting of query results for presentation to a user.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.