Over the past two decades, cameras have become very common devices to own and use around the world. For example, the advent of portable integrated computing devices has caused a wide proliferation of cameras. These integrated computing devices commonly take the form of smartphones or tablets and typically include general purpose computers, cameras, sophisticated user interfaces including touch sensitive screens, and wireless communications abilities through Wi-Fi, LTE, HSDPA, and other cell-based or wireless technologies. The wide proliferation of cameras including those in these integrated devices provides opportunities to use the devices' capabilities to take pictures and perform imaging related tasks very regularly. Contemporary integrated devices, such as smartphones and tablets, typically have one or two embedded cameras. These cameras generally amount to lens/camera hardware modules that may be controlled through the general purpose computer using downloadable software (e.g., “Apps”) and a user interface including the touch-screen fixed buttons and touchless controls, such as voice control.
One opportunity for using the features of an integrated device (or a contemporary camera device) is to capture and evaluate images. A resident camera allows the capture of one or more images and an integrated general-purpose computer provides processing power to perform analysis on the captured images. In addition, any analysis that is preferred for a network service computer can be facilitated by simply transmitting the image data or other data to a service computer (e.g., a server, a website, or other network-accessible computer) by using the communications capabilities of the device.
These abilities of integrated devices allow for recreational, commercial, and transactional uses of images and image analysis. For example, images may be captured and analyzed to decipher information from the images, such as readable text, characters, and symbols. The text, characters, and symbols may be transmitted over a network for any useful purpose, such as for use in a game, a database, or as part of a financial transaction. For these reasons and others, it is useful to enhance the abilities of these integrated devices and other devices for efficiently deciphering from images any regions containing text, characters, or any meaningful symbol information.
Some images contain text, characters, or other decipherable symbols that could be useful if those characters or symbols were directly accessed by a computer in the manner that, for example, an ASCII character may be accessed. In discussing embodiments herein, the term “text” may be used to represent all types of communicative characters, such as all types of alphabets and symbols, including Japanese, Chinese, Korean, and alphabets of other languages. Some embodiments of this disclosure seek to enhance a computer's ability to quickly and efficiently detect text that is visibly embodied in images. Further, by using an integrated device, such as a smartphone or tablet, a user may capture an image, have the image processed to detect text, and then use the deciphered information for general computing purposes (e.g., OCR the detected text and employ it in a local or web-based application, such as gaming; utility, such as augmented reality; office productivity, such as word processing or spreadsheets; or transactions, such as financial transactions).
One example of using an integrated device having a camera and a reasonably capable computer is to capture an image or rapid sequence of images and detect the presence of text on the fly, as images are captured. In other embodiments, images may be captured separately from the application of text detection as discussed herein. For example, in addition to being relatively simultaneous, image capture and text detection may be temporally separated, logically separated, and/or geographically separated (e.g., capture on a local device and text detection on a network/internet connected computer resource, such as a server).
Many embodiments of the disclosure are initiated with the capture or receipt of image information, such as a color or black and white image. Depending upon the embodiment, each received or captured image may be binarized or trinarized, meaning that the pixels (or other components) of the image may be categorized in two or three different ways. For example, in a binarized embodiment, each pixel may be represented as either black or white, depending upon the original status of the pixel. In a trinarized embodiment, each pixel may be categorized as black, white, or gray, where gray represents a pixel (or other component) that has been determined to not comprise or be comprised of text. In some embodiments, the trinarized example may be more efficient because the gray pixels may be eliminated from further text-related analysis.
After an image (or part thereof) has been binarized or trinarized, some embodiments of the invention attempt to identify blobs, which may, for example, be connected components of common color (e.g., the connected black pixels and the connected white pixels). In some embodiments, after blobs are detected, the resulting image may be described as gray area and a collection of blobs or connected components. Furthermore, once a plurality of blobs has been identified, some embodiments of the disclosure seek to group blobs or connected components into horizontal sequences. Moreover, the horizontal sequences may be found in either a top down or bottom up process (i.e., by starting with large sequences of blobs and working toward smaller sequences, or by starting with smaller sequences, such as one or two sequential blobs, and working up to larger sequences). Finally, after one or more horizontal sequences has been identified, a series of statistical tests may be applied to each sequence and/or blob to determine if the sequence or blob does not include text. After this elimination process, in some embodiments, any remaining sequences are determined or presumed to be detected text (although not necessarily recognized text). By performing the analysis without engaging in OCR or any glyph recognition, the invention is able to detect text in any alphabet, so it is very useful globally.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The inventive embodiments described herein may have implication and use in all types of cameras and in single- and multi-processor computing systems or devices including any type of camera. The discussion herein references a common computing configuration having a CPU resource including one or more microprocessors. The discussion is only for illustration and is not intended to confine the application of the invention to the disclosed hardware. Other systems having other known or common hardware configurations are fully contemplated and expected. With that caveat, a typical hardware and software operating environment is discussed below. The hardware configuration may be found, for example, in a camera device; a phone; or any computing device, such as a portable computing device comprising a phone and/or a computer and/or a camera.
Referring to
Processor 105 may execute instructions necessary to carry out or control the operation of many functions performed by device 100 (e.g., such as the generation and/or processing and/or evaluation and analysis of media, such as images). In general, many of the functions described herein are based upon a microprocessor acting upon software (instructions) embodying the function. Processor 105 may, for instance, drive display 110 and receive user input from user interface 115. User interface 115 can take a variety of forms, such as a button, a keypad, a dial, a click wheel, a keyboard, a display screen and/or a touch screen, or even a microphone or camera (video and/or still) to capture and interpret input sound/voice or images, including video. The user interface 115 may capture user input for any purpose, including for use as images or instructions to capture images or instructions to the system for any other function.
Processor 105 may be a system-on-chip, such as those found in mobile devices, and may include a dedicated graphics processing unit (GPU). Processor 105 may be based on reduced instruction-set computer (RISC) or complex instruction-set computer (CISC) architectures or any other suitable architecture and may include one or more processing cores. Graphics hardware 120 may be special purpose computational hardware for processing graphics and/or assisting processor 105 to process graphics information. In one embodiment, graphics hardware 120 may include one or more programmable GPUs.
Image capture circuitry 150 may capture still and video images that may be processed to generate images for any purpose. including to be analyzed to detect text, characters. and/or symbols in accordance with the teachings described herein. Output from image capture circuitry 150 may be processed, at least in part, by video codec(s) 155 and/or processor 105 and/or graphics hardware 120 and/or a dedicated image processing unit incorporated within image capture circuitry 150. Images so captured may be stored in memory 160 and/or storage 165 and/or in any storage accessible on an attached network. Memory 160 may include one or more different types of media used by processor 105, graphics hardware 120, and/or image capture circuitry 150 to perform device functions. For example, memory 160 may include memory cache, electrically erasable memory (e.g., flash), read-only memory (ROM), and/or random access memory (RAM). Storage 165 may store media (e.g., audio, image, and video files); computer program instructions; or other software, including database applications, preference information, device profile information, and any other suitable data. Storage 165 may include one more non-transitory storage media, including, for example, magnetic disks (fixed, floppy, and removable) and tape; optical media, such as CD-ROMs and digital video disks (DVDs); and semiconductor memory devices, such as Electrically Programmable Read-Only Memory (EPROM) and Electrically Erasable Programmable Read-Only Memory (EEPROM). Memory 160 and storage 165 may be used to retain computer program instructions or code organized into one or more modules and written in any desired computer programming language. When executed by, for example, processor 105, such computer program code may implement one or more of the methods or functions described herein.
Referring now to
Also coupled to networks 205, and/or data server computers 210, are client computers 215 (i.e., 215A, 215B and 215C), which may take the form of any computer, set top box, entertainment device, communications device or intelligent machine, including embedded systems. In some embodiments, users may employ client computers in the form of smart phones, tablets, laptops, or other computers. In some embodiments, network architecture 200 may also include network printers, such as printer 220, and storage systems, such as 225, which may be used to store multi-media items (e.g., images) that are referenced herein. To facilitate communication between different network devices (e.g., data servers 210, end-user computers 215, network printer 220, and storage system 225), at least one gateway or router 230 may be optionally coupled therebetween. Furthermore, in order to facilitate such communication, each device employing the network may comprise a network adapter. For example, if an Ethernet network is desired for communication, each participating device must have an Ethernet adapter or embedded Ethernet capable ICs. Further, the devices must carry network adapters for any network in which they will participate.
As noted above, embodiments of this disclosure include software. As such, a general description of common computing software architecture is provided as expressed in the layer diagrams of
With those caveats regarding software, referring to
No limitation is intended by these hardware and software descriptions and the varying embodiments of the inventions herein may include any manner of computing device, such as Macs, PCs, PDAs, phones, servers, or even embedded systems.
A Text Detector
Many embodiments of the disclosure contemplate a text detector for detecting the presence of text in an image or image information. Examples of images containing text are shown in
A Process for Text Detection
Referring to
Referring to
Referring again to
In one embodiment pertaining to 402, if the input image is not grayscale, it is transformed to an image where pixels are represented as YCbCr. For example, pixels may be converted from RGB values to YCbCr values. In one or more embodiments, the chroma values (Cb and Cr) may be ignored and filtering and/or thresholds may be applied only with respect to the luma channel (Y), which represents a grayscale image. In some embodiments, the chroma value may be used to refine the filtering/threshold-applied results. For example, chroma values may be used to: categorize pixels in more ways than luma will allow or to make a categorization decision about a pixel that is uncertain or less certain with reference only to the luma value.
In some embodiments, item 402-threshold application includes binarization or trinarization of the image pixels or portions. Binarization transforms the image by representing each pixel or portion in one of only two states, such as black and white. If the image of
Adaptive Block Based Process
In some embodiments, the result of binarization/trinarization is to remove effects like uneven illumination, shadows, shifted white-levels, and similar effects. One particular embodiment employs a locally-adaptive block-based algorithm with partially overlapping blocks. An example of this embodiment is shown in
Referring again to
Referring again to
In some embodiments, if the difference between minimum and maximum is larger than a second threshold (which may be equal to the first threshold), then each pixel or portion in the block may be evaluated/categorized as black or white. Following the example from above, if the minimum/maximum difference is 48 or above (or above a threshold), some embodiments may set a black point and evaluate each pixel in the block with respect to the black point. For example, any pixel with a value less than the black point may be classified as black, and any pixel with a value larger than the black point may be classified as white. In one or more embodiments, the black point is determined individually for each block. In addition, in some embodiments, the black point for a block may be (minimum+maximum)/2 or some other function of the minimum and maximum values.
After each block and/or pixel and/or portion in an image has been evaluated according to
Referring back to
Connected Component Builder
In some embodiments, blob detection 403 involves a connected component builder. In some embodiments, the connected component builder uses a trinarized or binarized image as discussed above for its input. The connected component builder may build groups of like pixels for black and/or white and/or gray pixels. In some embodiments, connected components are identified in one program pass through the pixels or portions, meaning that all pixels are swept through once before a result is achieved (e.g., the components are identified, related, and labeled if necessary in one pass).
A group of contemplated embodiments for a connected component builder are shown in
Referring now to
Referring again to
Referring to
Referring to
Referring now to
Referring to
Yet a third embodiment for finding related line segments involves keeping track of line segments in each row. For example, in these embodiments, all line segments are created in order from left to right and top to bottom. In other words, pixels are swept within rows from left to right and rows are investigated from top to bottom. Thus, referring to
Referring again to
Referring now to
In some embodiments, when adding a line segment to a connected component, the program may update and/or retain statistics for the connected component. In some embodiments, these statistics are retained or updated on the fly, meaning, for example, the work is performed during the program pass of the pixels 605. Examples of these statistical items are as follows:
In addition regarding the data structures 601 and 602, varying embodiments of the disclosure contemplate different data structures, although any suitable data structure may be used. In one embodiment, the line segment structure 601 is a form of run length encoding. In the same or potentially other embodiments, the connected component structure 601 references or holds the segments in a linked list.
Referring again to
The program finds the closest connected component of the same color that is directly right of the current connected component (by looking for the closest neighbor to the right, the program need only check a small subset of connected components for each merge, which makes the algorithm faster). The two components may be merged into a sequence if one or more (and in some embodiments all) of the following criteria are found. The two connected components are: roughly the same size; roughly horizontal (either with a larger frame of reference, such as the image, or only with respect to one another); roughly the same area; not too far away from each other compared to their size (e.g., beyond a threshold that may be derived based upon connected component size); and do not have more than two holes each. The formed sequences, for example, may represent words, where each connected component may represent a letter.
In some embodiments, the sequences may be merged with connected components or other sequences in much the same way by finding the closest neighboring connected component (or sequence) to the right until no further merges are available. Of course, when merging sequences, some embodiments deemphasize requirements regarding the same size and area due to the expected length differential in the horizontal dimension. Furthermore, as shown in
After a sequence is created, statistics may be calculated for the sequence such as:
The statistics may be used for further analysis. For example, in some embodiments, any connected components at the beginning or end of a sequence that do not have roughly the same statistics as those calculated for the sequence as a whole may be removed from the sequence. In some embodiments, the “sameness” may be calculated based upon thresholds that are either absolute or derived from the statistics of the sequence or characteristics of the pixels 605 as analyzed.
In some embodiments, the slant of a sequence is calculated. Furthermore, any connected components that do not belong to any sequence but are close to a sequence spatially, are evaluated regarding whether they might be either punctuation or a diacritic mark depending on their size, shape, position, and number of holes.
In other embodiments of blob grouping 404, a top-down approach may be used where the program starts with larger connections (instead of a single connected component) and works down to smaller components.
Referring again to
Consider eliminating sequences or connected components based upon the number of connected components that are within a sequence's bounding box, but do not belong to the sequence (the number perhaps subject to a threshold). In some embodiments, this test relates best to filtering out textures.
Consider eliminating sequences or connected components based upon whether the average width of the sequence is comparable to the width of the component (the difference perhaps subject to a threshold). In some embodiments, this test relates best to filtering out textures.
Consider eliminating sequences or connected components based upon a comparison of a connected component's height with the average height within the sequence (the difference perhaps subject to a threshold). In some embodiments, this test relates best to filtering out textures.
Consider eliminating sequences or connected components based upon horizontal overlap of sequences/components (the size of the overlap perhaps subject to a threshold). In some embodiments, this test relates best to filtering out textures.
Consider eliminating sequences or connected components based upon a comparison of vertical variance with the sequence slant (the difference perhaps subject to a threshold). In some embodiments, this test relates best to filtering out textures.
Consider eliminating sequences or connected components based on the amount of noise in the background of the sequence bounding box (the amount of noise perhaps subject to a threshold). In some embodiments, this test relates best to filtering out textures.
Consider eliminating sequences or connected components based upon the kurtosis of the histogram of the pixels belonging to all the connected components in a sequence (the sharpness of the peak perhaps subject to a threshold). In some embodiments, this test relates best to filtering out textures.
Consider eliminating sequences or connected components based upon the connected component fill degree (the fill degree perhaps subject to a threshold). In some embodiments, this test relates best to filtering out barcodes or rows of windows.
Consider eliminating sequences or connected components based upon how the connected component circumference compares to bounding box size (the difference perhaps subject to a threshold). In some embodiments, this test relates best to filtering out barcodes or rows of windows.
Consider eliminating sequences or connected components based upon how the connected component's or sequence's longest line segment compares to the bounding box (the size difference perhaps subject to a threshold). In some embodiments, this test relates best to filtering out barcodes or rows of windows.
Referring again to
Multiple Scales
Some embodiments of the disclosure apply the process of
The process of
The process begins with the sequences from the original input image and moves downward in size so that the largest downscaled image (e.g., ½) is analyzed first and the smallest downscaled image is analyzed last.
An array is created to hold all merged sequences and, in some embodiments, all the sequences from the original image.
Starting with the largest downscaled image (e.g., the ½ scale image), the program loops through the identified sequences one-by-one and compares them to the sequences in the merged array (which at this point would only contain the sequences from the original image). If there is overlap in the bounding box for any of the sequences, the individual connected components of the overlapping sequences are also compared. If the program finds significant overlap between any of the connected components of the full-scale sequence and the downscaled sequence (e.g., the ½ scale sequence), the sequences are classified as having detected the same text in the image. The program then finds the sequence in the merged array that has the largest number of overlapping connected components and makes a determination whether to replace it with the ½ scale sequence. The ½ scale sequence may be chosen if it is wider and has at least 2 more connected components than the full scale sequence. If the ½ scale sequence is chosen, the program removes all sequences in the merged array that have overlapping connected components with the ½ scale sequence.
Once the program has run through all sequences in the largest downscaled image (e.g., the ½ scale image), the same process is performed for the next downscaled image (e.g., the ¼ scale image), at which point the program is comparing the merged array (which contains sequences from both the full scale image and the ½ scale image) with the sequences from ¼ scale image. This process is continued until the scaled versions have been merged.
The process of this embodiment ensures or protects from returning multiple sequences for the same text in the image. Some embodiments choose the widest sequence because a sequence detected as two words in one scale may often be detected as a sentence in another.
API
The disclosure above has discussed embodiments of the disclosure wherein a program interface may be used to access text detection described herein. The API form envisioned by many embodiments herein is similar to an API published by Apple called CIDetector, which is further described in the provisional application incorporated herein by reference (provisional application No. 62/172,117, filed Jun. 7, 2015). In summary, A CIDetector object uses image processing to search for and identify notable features (faces, rectangles, and barcodes) in a still image or video. Detected features are represented by CIFeature objects that provide more information about each feature. In some envisioned embodiments of the current disclosure, with respect to a text detector, it may be created by calling
context:(CIContext*)context
options:(NSDictionary*)options
Where type is one of:
The newly added type will be
Once an appropriate CIDetector is created, it may detect text in an image by calling either of
(NSArray*)featuresInImage:(CIImage*)image
(NSArray*)featuresInImage:(CIImage*)image
In some embodiments, if any text is detected, the function will return an array of CITextFeature. By way of example CIFeature Class Reference is also part of the provisional application No. 62/172,117.
Some embodiments of this disclosure contemplate that CITextFeature has similar properties to the CIQRCodeFeature. In certain embodiments, CITextFeature has no messageString property. The CIQRCodeFeature Class Reference is also disclosed in provisional application No. 62/172,117.
As discussed above, embodiments of the disclosure employ a bounding box. A framework embodiment implementation of that concept is apparent in that CIText has a bounding box as follows:
@property(readonly, assign) CGRect bounds
And four corners
@property(readonly, assign) CGPoint bottomLeft
@property(readonly, assign) CGPoint bottomRight
@property(readonly, assign) CGPoint topLeft
@property(readonly, assign) CGPoint topRight
It is to be understood that the above description is intended to be illustrative, and not restrictive. The material has been presented to enable any person skilled in the art to make and use the invention as claimed and is provided in the context of particular embodiments, variations of which will be readily apparent to those skilled in the art (e.g., many of the disclosed embodiments may be used in combination with each other). In addition, it will be understood that some of the operations identified herein may be performed in different orders. The scope of the invention therefore should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein.”
This patent application claims priority to provisional patent application No. 62/172,117, filed Jun. 7, 2015, which is hereby incorporated by reference in its entirety.