While information is increasingly communicated in electronic form with the advent of modern computing and networking technologies, physical documents, such as printed and handwritten sheets of paper and other physical media, are still often exchanged. Such documents can be converted to electronic form by a process known as optical scanning. Once a document has been scanned as a digital image, the resulting image may be archived, or may undergo further processing to extract information contained within the document image so that the information is more usable. For example, the document image may undergo optical character recognition (OCR), which converts the image into text that can be edited, searched, and stored more compactly than the image itself.
As noted in the background, a physical document can be scanned as a digital image to convert the document to electronic form. Traditionally, dedicated scanning devices have been used to scan documents to generate images of the documents. Such dedicated scanning devices include sheetfed scanning devices, flatbed scanning devices, and document camera scanning devices, as well as multifunction devices (MFDs) or all-in-one (AlO) devices that have scanning functionality in addition to other functionality such as printing functionality. However, with the near ubiquitousness of smartphones and other usually mobile computing devices that include cameras and other types of image-capturing sensors, documents are often scanned with such non-dedicated scanning devices.
When scanning documents using a dedicated scanning device, a user may not have to individually feed each document into the device. For example, the scanning device may have an automatic document feeder (ADF) in which a user can load multiple documents. Upon initiation of scanning, the scanning device individually feeds and scans the documents, which may result in generation of an electronic file for each document or a single electronic file including all the documents. For example, the electronic file may be in the portable document format (PDF) or another format, and in the case in which the file includes all the documents, each document may be in a separate page of the file.
However, some dedicated scanning devices, such as lower-cost flatbed scanning devices as well as many document camera scanning devices, do not have ADFs. Non-dedicated scanning devices such as smartphones also lack ADFs. To scan multiple documents, a user has to manually position and cause the device to scan or capture images of the documents individually, on a per-document basis. Scanning multiple documents is therefore more tedious, and much more time consuming, than when using a dedicated scanning device that has an ADF.
Techniques described herein ameliorate these and other difficulties. The described techniques permit multiple documents to be concurrently scanned, instead of having to individually scan or capture images of the documents on a per-document basis. A dedicated scanning device or a non-dedicated scanning device can be used to capture an image of multiple documents. For example, multiple documents can be positioned on the platen of a flatbed scanning device and scanned together as a single captured image, or the camera of a smartphone can be used to capture an image of the documents as positioned on a desk or other surface in a non-overlapping manner.
The described techniques extract segmentation masks that correspond to identified documents within the captured image, permitting the documents to be segmented into different electronic files or as different pages of the same file. A segmentation mask for a document is a mask that has edges corresponding to the edges of the document. Therefore, applying the segmentation mask for a document against the captured image generates an image of the document. The segmentation masks for the identified documents within the captured image are thus individually applied to the captured image of all the documents to generate images that each correspond to one of the documents.
A point extraction machine learning model 108 is applied (110) to the captured image 102 of the documents 104 to identify (112) the documents 104 via their respective center points 116 within the captured image 102 as well as boundary points 118 for each identified document 104. For example, the captured image 102 may be input into the point extraction model 108. The model 108 then responsively outputs the center points 116 of the documents 104 and the boundary points 118 for each document 104 for which a center point 116 has been identified. Each center point 116 thus corresponds to a document 104 and is associated (117) with a set of boundary points 118 of the document 104 in question.
The point extraction machine learning model 108 is said to identify the documents 104 within the captured image 102 insofar as the model 108 identifies a center point 116 of each document 104 within the image 102. The center point 116 of a document 104 within the captured image 102 is the precise or approximate center of the document 104 within the image 102. For each document 104 that the point extraction model 108 has identified via a center point 116, the model 108 provides a set of boundary points 118. Each boundary point 118 of a document 104 is a point on an edge of the document 104 within the captured image 102.
The center points 116 of the documents 104 and their associated sets of boundary points 118 may be displayed (120) in an overlaid manner on the captured image 102. A user may then be permitted to modify the boundary points 118 for each document 104 identified by a corresponding center point 116 (122). For example, the user may be permitted to remove erroneous boundary points 118 that are not the edges of a document 104, or move such boundary points 118 so that they are more accurately located on the edges of the document 104 in question. The user may be permitted to further add additional boundary points 118, so that the boundary points 118 of a document 104 accurately reflect every edge of each document 104.
A specific example of the point extraction machine learning model 108 is described later in the detailed description. The model 108 is a machine learning model in that it leverages machine learning to extract the document center points 116 and the document boundary points 118 within the captured image 102. For example, the model 108 may be a convolutional neural network machine learning model. The model 108 is a point extraction model in that it extracts points, specifically the document center points 116 and the document boundary points 118.
For the documents 104 identified by the center points 116, an instance segmentation machine learning model 124 is applied (126) to the boundary points 118 of the documents 104 (as may have been modified) and the captured image 102 of all the documents 104 to extract (128) segmentation masks 130 for the identified documents 104. For instance, the boundary points 118 of the documents 104 may be input on a per-document basis, along with the captured image 102, into the instance segmentation model 124. The model 124 then responsively outputs on a per-document basis the segmentation masks 130 for the documents 104, where each mask 130 corresponds to one of the documents 104.
For example, if there are n documents 104 identified by the center points 116, then the instance segmentation machine learning model 124 is applied n times, once for each such identified document 104. To extract the segmentation mask 130 for the i-th document 104, where i=1 . . . n, the boundary points 118 for just this document 104 are input into the image segmentation model 124, along with the captured image 102 of all the documents 104. That is, the boundary points 118 for the other documents 104 are not input into the model 124.
A specific example of the instance segmentation machine learning model 124 is described later in the detailed description. The model 124 is a machine learning model in that it leverages machine learning to extract a document segmentation mask 130 for each document 104 identified within the captured image 102 by the point extraction model 108. For example, the model 124 may be a convolutional neural network machine learning model. The model 124 is an instance segmentation machine learning model in that the segmentation mask 130 extracted for a document 104 can be used to segment the captured image 102 in correspondence with this document 104, which is considered as an instance in this respect.
The segmentation masks 130 of the documents 104 may be displayed (132) in an overlaid manner on the captured image 102 for user approval. For instance, the user may not approve (134) of a segmentation mask 130 for a given document 104 if the mask 130 does not have edges that accurately correspond to the edges of the document 104 within the image 102. The process 100 may therefore revert back to displaying (120) the center point 116 and the boundary points 118 for any such document 104 for which a segmentation mask 130 has been disapproved.
In such instance, the user is therefore again afforded the opportunity to modify (122) the boundary points 118 for the disapproved documents 104. The instance segmentation model 124 is then reapplied (126) for each such document 104 on the basis of its newly modified boundary points 118 (and the captured image 102 itself) to reextract (128) the segmentation masks 130 for these documents 104. This iterative workflow permits segmentation masks 130 to be more accurately reextracted without having to recapture the image 102, permitting such reextraction of the masks 130 even if the documents 104 are no longer available for such recapture within a new image 102.
Existing segmentation mask extraction techniques, by comparison, may not permit a user to be extract a more accurate segmentation mask 130 for a document 104 without the user capturing a new image 102 of the document 104. If the document 104 is no longer available, such techniques are therefore unable to extract a more accurate segmentation mask 130 if the user disapproves of the initially extracted mask 130 for the document 104. By comparison, the process 100 provides for extraction of a potentially more accurate segmentation mask 130 by permitting the user to modify the boundary points 118 on which basis the instance segmentation model 124 extracts the mask 130, without having to capture a new image 102.
Upon user approval of the segmentation masks 130 for the documents 104 identified within the captured image 102 (134), the segmentation masks 130 are individually applied (136) to the captured image 102 to segment the image 102 into separate images 138 corresponding to the documents. That is, the segmentation mask 130 for a given document 104 is applied to the captured image 102 to extract a corresponding document image 138 from the image 102. The image 138 for each document 104 may be an electronic file in the same or different image file format as the electronic file of the captured image 102.
The process 100 can conclude by performing an action (140) on the individually extracted document images 138. For instance, the separate document images 138 may be saved in corresponding electronic image files, may be displayed to the user, or may be printed on paper or other printable media. Other actions that may be performed include image enhancement and/or processing, optical character recognition (OCR), and so on. For instance, the document images 138 may be individually rectified and/or deskewed, as two examples of image processing.
In this respect, the process 100 can provide for accurate segmentation of an identified document 104 within the captured image 102 even if the document 104 is skewed within the image 102. For example, a user may capture an image 102 of a page of a book as a document 104. The thicker the book is, the more difficult it will be to flatten book when capturing of an image 102 of the page of interest as the document 104 (particularly without damaging the binding of the book), and therefore the more skewed the document 104 is likely to be within the image 102.
The process 100 can provide for accurate segmentation of such a document 104 within the image 102. This is at least because the instance segmentation model 124 is operative on a set of boundary points 118 for the document 104 that can be user adjusted if the boundary points 118 as initially provided by the point extraction model 108 do not result in extraction of an accurate segmentation mask 130 for the document 104. By comparison, existing segmentation mask techniques may assume that a document 104 is rectangular, or at least polygonal, in shape within captured image 102, and therefore may not be able to provide for accurate segmentation of the document 104 if a document 104 is skewed within the image 102.
In
The heatmap 210 may be a monochromatic or grayscale image of the same size as the captured image 200, in which pixels have increasing (or decreasing) pixel values in correspondence with their likelihood of being the actual center points 212 of the documents 202. Therefore, there may be a collection or cluster of pixels at the center of each document 202, with the center of the cluster, or the pixel having the highest (or lowest) pixel value, corresponding to the center point 212 in question. In the example of
In
The boundary points 222 identified by the point extraction model 108 may, but do not necessarily, include corner points of the documents 202. In general, each edge of a document 202 may have a sufficient number of boundary points 222 identified by the model 108 to define or accurately reflect the contour of the edge in question. As has been noted, the user may be afforded to the opportunity to adjust the boundary points 222 identified by the point extraction model 108 so that the boundary points 222 of the documents 202 are sufficiently indicated to result in accurate segmentation mask extraction.
In
In
The point extraction machine learning model 108 may leverage existing machine learning models. An example of such a machine learning model is described in Xie et al., “Polarmask: Single Shot Instance Segmentation with Polar Representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020) (hereinafter, the “Polarmask reference”). However, the point extraction model 108 differs from the model used in the Polarmask reference in at least two ways.
First, the Polarmask reference identifies the center point of a single object within an image and this object's boundary points at regular polar angles around the center point, and then stitches or joins together the center boundary points to form a segmentation mask of the object. By comparison, the point extraction model 108 does not stitch or join together the boundary points 118 of each document 104 for which a center point 116 has been identified to generate a segmentation mask 130 for the document 104 in question. Rather, another machine learning model—the instance segmentation model 124—is applied to the captured image 102 and the boundary points 118 of each document 104 (on a per-document basis) to generate segmentation masks 130 for the documents 104.
Therefore, the segmentation masks 130 are generated in a different manner than that described in the Polarmask reference. Stated another way, the point extraction machine learning model 108 extracts the boundary points 118 for the documents 104 identified by their center points 116, and does not generate the segmentation masks 130, in contradistinction to the Polarmask reference. The utilization of another machine learning model—the instance segmentation model 124—has been demonstrated to provide for superior segmentation mask generation as compared to the approach used in the Polarmask reference.
Second, the Polarmask reference employs a residual neural network (ResNet) architecture as the backbone network 302, which is described in Targ et al., “Resnet in Resnet: Generalizing Residual Architectures,” arXiv: 1603.08029 (2016). By comparison, the point extraction machine learning model 108 may use a version of the MobileNetV2 architecture as the backbone network 302. This architecture is described in Mark Sandler et al., “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018).
The instance segmentation machine learning model 124 may leverage existing machine learning models. An example of such a machine learning model is described in Maninis et al., “Deep Extreme Cut: From Extreme Points to Object Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) (hereinafter, the “DEXTR reference”). However, the instance segmentation model 124 differs from the model used in the DEXTR reference in at least two ways.
First, the DEXTR reference extracts a segmentation mask of a single object within an image from the object's extreme boundary points as manually user input or specified. Specifically, the DEXTR reference requires that a user specify the corner points of an object. By comparison, the instance segmentation model 124 does not require manual user boundary point specification for each document 104, but rather leverages the boundary points 118 that are initially identified or extracted by the point extraction model 108. That is, another machine learning model—the point extraction model 108—is first applied to the captured image 102 to extract the boundary points 118 for each of one or multiple documents 104.
Moreover, because the DEXTR reference is not as well equipped to accommodate skewed documents 104 that have curved edges. Corner, or extreme, boundary points may not sufficiently define such edges of such documents 104, and having a user specify sufficient such points can require considerably more skill on the part of the user. A novice user, for instance, may be unable to identify which such boundary points 118 should be specified. The instance segmentation model 124 ameliorates this issue by having a different model—the point extraction model 108—provide initial extraction of the boundary points 118 of the documents 104.
Second, the DEXTR reference, like the Polarmask reference, employs a ResNet architecture as the backbone network 302. By comparison, the point extraction machine learning model 108 may use a version of the MobileNetV2 architecture as the backbone network 302. Such a backbone network 302 can better balance performance and size as compared to the ResNet architecture.
The usage of two machine learning models—a point extraction model 108 to initially extract the boundary points 118 of potentially multiple documents 104 and an image segmentation model 124 to then individually extract their segmentation masks 130—provides for demonstrably more accurate segmentation masks 130 as compared to the Polarmask or DEXTR reference alone. Furthermore, the workflow afforded by the process 100 of
The processing includes applying a point extraction machine learning model to the captured image of one or multiple documents to identify the documents within the captured image and to identify boundary points for each document (404). The processing includes, for each document identified within the captured image, applying an instance segmentation machine learning model to the boundary points for the document and to the captured image to extract a segmentation mask for the document (406). As noted, the extracted segmentation masks can then be individually applied to the captured image to extract images corresponding to the documents from the captured image.
The instructions 508 are executable by the processor 504 to apply a point extraction machine learning model to the captured image to identify the documents within the captured image and to identify boundary points for each document (510). The instructions 508 are executable by the processor 504 to, for each document identified within the captured image, then apply an instance segmentation machine learning model to the boundary points for the document and to the captured image to extract a segmentation mask for the document (512). The instructions 508 are executable by the processor 504 to, for each document identified within the captured image, subsequently apply the segmentation mask for the document to the captured image to extract an image of the document from the captured image (514).
Techniques have been described for extracting segmentation masks for one or multiple documents within a captured image. Multiple documents can therefore be more efficiently scanned. Rather than a user having to individually capture an image of each document, the user just has to capture one image of multiple documents (or multiple images that each include more than one document). Furthermore, the extracted segmentation masks accurately correspond to the documents, even if the documents are skewed within the captured image.