1. Field of the Invention
This invention generally relates to digital image processing and, more particularly, to a system and method that determines a phrase associated with a color-highlighted area of the document, and automatically locates and marks other instances of the phrase in the document.
2. Description of the Related Art
The use of color highlighting recognition, for use with scanned documents, is becoming more prevalent. Likewise, it is now possible to print color documents at lower costs than in the past. However, there are a limited number of digital document processes that take advantage of color scanning features, or that recognize that documents are now often printed in color.
Conventionally, if a person wants to highlight similar terms on an original printed document, they must manually read each page, find the similar terms, and highlight them. This can be a tedious process, especially with long documents, and terms can easily be missed.
It would be advantageous if the color processing capabilities of digital document devices could be maximized.
It would be advantageous if a digital document process, such as a word search or administrative operation, could be initiated by using color to highlight an area of a hardcopy document.
It would be advantageous if the above-mentioned color highlighting process could be used to reduce the man-hours associated with printing, archiving, or communicating a document.
A system and method are provided that permit a user to highlight one or more terms on an original paper, and scan the document. An imaging device, such as a multifunctional peripheral (MFP), or a networked server, scans the document in color and recognizes whether the page contains color highlights over text, using image segmentation. Then, the entire set of scanned pages is run through a text recognition process (OCR), which can be on a networked server, or contacted through a web service directly from the MFP. Secondary processing recognizes words that are highlighted in appropriate colors (keywords). These keywords are located in response to searching the text of an OCR processed document. The terms or keywords are located in the remainder of the document, and associated with the same color highlighting that was initially applied to the original paper. Finally, a document, with the additional highlights, is printed by the MFP, emailed, or saved in image or text format facilitating reuse via common document formats like PDF.
This color highlighting technique can also be used for redaction of documents. A color highlight can be used to search for similar terms and then apply blackout redaction to the original through a slight modification to the process. The specific process and desired output may be selected prior to the scanning.
Accordingly, a method is provided for processing a document image using color highlighting. The method comprises: scanning a document, creating a document image; searching the document image for a color-highlighted area; processing the document image with optical character recognition (OCR), creating a text document; identifying a text phrase associated with the color-highlighted area; searching the text document for the identified text phrase; and, tracking each area in the document image associated with the identified text phrase.
Searching the document image for a color-highlighted area includes supplying a coordinate associated with the color-highlighted area. A text phrase in the text document is identified as being associated with the color-highlighted area in response to locating the text phrase at the color-highlighted area coordinates. Tracking each area in the document image associated with the identified text phrase includes: tracking the coordinates of each identified text phrase in the text document; and, transposing the coordinates to the document image.
In one aspect, a highlighted document is printed with markings in the tracked areas, following the transposing of the coordinates to the document image. For example, a print engine may generate a document image, temporarily store the document image, and overlay markings on the stored image corresponding to the transposed coordinates in the document image. Alternately, image markings are created in regions of the document image corresponding to the transposed coordinates, creating a marked document image. Then, the marked document image can be printed.
Tracking each area in the document image associated with the identified text phrase includes using a marking such as color highlighting, redacting, and text highlighting using font, bold, italics, or underling. For example, if the original document includes a phrase marked in yellow, each tracked occurrence of the phrase in the printed document could also be marked in yellow.
Additional details of the above-described method and a system for processing a document image using color highlighting are presented below.
An image segmentation module (ISM) 110 has an interface on line 108 to accept to the document image. The ISM 110 has an interface on line 112 to supply coordinates in response to searching the document image for the color-highlighted areas. An optical character recognition (OCR) module 114 has an interface on line 108 to accept the document image and an interface on line 112 to accept the color-highlighted area coordinates. The OCR module 114 creates a text document from the document image and supplies the text document and a text phrase, identified in the text document as being associated with the color-highlighted area coordinates, at an interface on line 116.
A search module 118 has an interface to accept the text document and the identified text phrase on line 116. The search module 118 searches the text document for the identified text phrase and supplies coordinates for the location of each identified text phrase at an interface on line 120. A bitmap processing module (BPM) 122 has an interface on line 108 to accept the document image, and an interface on line 120 to accept the identified text phrase coordinates. The BPM 122 supplies a document image tracking each area associated with the identified text phrase coordinates on line 124. That is, the bitmap processing module 122 transposes identified text phrase coordinates in the text document into coordinates in the document image.
The bitmap processing module 122 tracks each area associated with the identified text phrase coordinates by using a marking such as color highlighting, redacting, and text highlighting using font, bold, italics, or underling to name a few examples. There are other conventional forms of marking that can be used to draw a reader's attention to certain areas of a document that can be used to help enable the system. Note, at this stage in the process, the “markings” are in an electronic form.
For example, the image segmentation module 110 may search the document image for an area highlighted in a first color (i.e., yellow). A text phrase, i.e., “profit”, is identified in the first color-highlighted area. The bitmap processing module 122 tracks each area associated with the identified text phrase coordinates by marking the tracked areas with the yellow (first) color. Alternately, the BPM 122 can mark the tracked areas using a means other than color, for example, the tracked areas can be marked by underlying. That is, the BPM 122 underlines or color-marks each instance of the word “profit”.
The system 100 may further comprises a print engine 126 having an interface on line 124 to accept the document image from the bitmap processing module. The print engine 126 has an interface on line 128 to supply a printed highlighted document with markings 127 in the tracked areas. In one aspect, the print engine 126 prints the highlighted document as a two or three-step operation. The print engine generates the document image to be printed, stores the document image in memory 129. Note, in some aspects the print engine receives the document image in a ready-to-print format. Then, the print engine 126 overlays markings in regions corresponding to the transposed coordinates in the document image, onto the document image in memory 129, prior to printing. That is, the print engine 126 generates a marked document image.
In a different aspect, the bitmap processing module 122 creates the marked document image with image markings in regions of the document image corresponding to the transposed coordinates. Then, the marked document image can be printed at print engine 126. That is, the marking process is transparent to the print engine 126.
In one aspect, the bitmap processing module 122 converts the marked document image into an image format such as tagged image format (TIFF or TIF) or portable document format (PDF). However, the system is not limited to any particular format. Then, the converted marked document can be emailed on line 130, or filed in memory 132.
In another aspect the system further comprises an auxiliary processing module (APM) 134 having an interface on line 116 to accept the text document and the identified text phrase. The auxiliary processing module 134 performs a process such as identifying an address in the text document, calculating the number of identified text phrase occurrences, automatically creating an index for identified text phrases, initiating a search for stored documents associated with the identified text phrase, sending a highlighted document image to an identified address in the document image, or filing a highlighted document image in a folder associated with the identified text phrase.
In a different aspect the system further comprises an electronically formatted thesaurus 136 accessible on line 138. The search module 118 accesses the thesaurus 136 for terms similar to the identified text phrase, searches the text document for the identified similar terms, and additionally supplies coordinates associated with identified similar terms. For example, the search module 118 may initiate a search for terms similar to “revenue”, and may choose to additionally highlight terms such as “income” and “cash”.
In one aspect the system further comprises an electronically formatted language translation dictionary 140 accessible on line 142. The search module 118 accesses the dictionary 140 for a translation of the identified text phrase, searches the text document for the identified translation term, and additionally supplies coordinates for identified translation terms. For example, the search module 118 may additionally highlight the German translation for the term “revenue”.
Several of the above-mentioned system elements may be enabled as a set of software instructions that can be stored in memory and manipulated by a microprocessor. However, other elements, such as the print engine and scanner, include at least some machinery. In some aspects, all the above-mentioned elements can reside in a common device, an MFP for example. However, the elements may also reside in network or locally-connected devices.
The above-described system builds upon, and uniquely combines some conventional technologies. Image segmentation is a process of locating regions on images based on analysis. This technology is commonly used in compression techniques like mixed-raster, to compress color regions differently from monochrome regions. A mixed raster compression (MRC) formatted document may result from processing using segmentation and recompressing into a file type with some monochrome compression, and some color compression for example. The system also builds upon a process of OCR text recognition, used after segmentation.
Step 302 scans a document, creating a document image. Step 304 searches the document image for a color-highlighted area. For example, Step 304 may use an image segmentation process to search for the color-highlighted area. Step 306 processes the document image with optical character recognition (OCR), creating a text document. Step 308 identifies a text phrase associated with the color-highlighted area. For example, Step 308 may identify the text phrase in the text document associated with the color-highlighted area. Step 310 searches the text document for the identified text phrase. Step 312 tracks each area in the document image associated with the identified text phrase.
Step 312 may track each area in the document image associated with the identified text phrase using a marking such as color highlighting, redacting, and text highlighting using font, bold, italics, or underling. In one example of the method, Step 304 searches the document image for an area highlighted in a first color. Then, Step 312 marks the tracked areas with the first color. Alternately, Step 312 may mark the tracked areas with a color other than the first color.
In another example, Step 304 searches for a plurality of areas highlighted with a corresponding plurality of different colors. For example, a yellow area associated with the word “revenue” and a blue area associated with the phrase “third quarter”. Identifying a text phrase associated with the color-highlighted area in Step 308 includes identifying a particular text phrase with each color. Then, tracking each area in the document image associated with the identified text phrase in Step 312 includes independently tracking areas associated with each text phrase.
In one aspect, searching the document image for a color-highlighted area in Step 304 includes supplying a coordinate associated with the color-highlighted area. Then, identifying a text phrase in the text document associated with the color-highlighted area in Step 308 includes identifying a text phrase in the text document corresponding to the color-highlighted area coordinates.
In another aspect, tracking each area in the document image associated with the identified text phrase in Step 312 includes substeps. Step 312a tracks the coordinates of each identified text phrase in the text document. Step 312b transposes the coordinates to the document image.
In a different aspect, following the transposing of the coordinates to the document image (Step 312b), Step 314 prints a highlighted document with markings in the tracked areas. For example, Step 314 may include substeps. Step 314a generates the document image at the printer. Alternately, the document image is received in a printer-ready format. Step 314b stores the document image in printer memory. Step 314c overlays markings, in regions corresponding to the transposed coordinates in the document image, onto the document image in memory prior to printing.
Alternately, Step 313 creates image markings in regions of the document image corresponding to the transposed coordinates, creating a marked document image. Then, Step 314 prints the marked document image as a highlighted document.
In another aspect, Step 316 converts the marked document image into an image format such as TIF or PDF. Then, Step 318 either emails the converted document or files the converted document in memory. Other operations are also possible to perform using the converted format document.
In a different aspect Step 309, following the searching of the OCR processed document for the identified text phrase (Step 308), performs a process such as identifying an address in the text document, sending the marked document image to an identified address in the document image, calculating the number of identified text phrase occurrences, automatically creating an index for identified text phrases, filing the marked document image in a folder associated with the identified text phrase, or initiating a search for stored documents associated with the identified text phrase.
In another aspect Step 307a accesses a thesaurus for terms similar to the identified text phrase. Then, Step 308 additionally searches the text document for the identified similar terms, and Step 312 additionally tracks areas in the document image associated with identified similar terms.
Alternately, Step 307b accesses a language translation dictionary for a term associated with the identified text phrase. Then, Step 308 additionally searches the text document for the identified translated term, and Step 312 additionally tracks areas in the document image associated with the translated term.
A system and method have been provided for marking terms in a document in response to initially identifying a term associated with a color-highlighted region, and tracking each instance of the identified term in the document. A few examples of initial color highlighting means have been presented, but the invention is not limited to just these examples. For example, the invention might be used to initially identify other kinds of markings, such as circles or underlines. Further, the invention can be extended to identify images, logos, signatures, and the like, as well as just words. Examples have also been given of the manner in which the final document might be marked, after all the terms have been located. Again, the invention is not limited to merely these examples. Other variations and embodiments of the invention will occur to those skilled in the art.