This patent application relates to devices and methods for applying rules (called “clustering rules”) to check whether or not blocks of one or more regions in an image should be merged, prior to classification of the blocks as text or non-text.
Identification of text regions in documents that are scanned (e.g. by an optical scanner of a printer or copier) is significantly easier than detecting text regions in images generated by a handheld camera, of scenes in the real world (also called “natural images”). Specifically, optical character recognition (OCR) methods of the prior art originate in the field of document processing, wherein the document image contains a series of lines of text (e.g. 20 lines of text) of a scanned page in a document. Document processing techniques, although successfully used on scanned documents created by optical scanners, generate too many false positives and/or negatives so as to be impractical when used on natural images. Hence, detection of text regions in a real world image generated by a handheld camera is performed using different techniques. For additional information on techniques used in the prior art, to identify text regions in natural images, see the following articles that are incorporated by reference herein in their entirety as background:
Image processing techniques of the prior art described above appear to be developed primarily to identify regions in images that contain text which is written in the language English. Use of such techniques to identify in natural images, regions of text in other languages that use different scripts for letters of their alphabets can result in false positives and/or negatives so as to render the techniques impractical.
Additionally, presence in natural images, of text written in non-English languages, such as Hindi can result in false positives/negatives, when technique(s) developed primarily for identifying text in the English language are used in classification of regions as text/non-text. For example, although blocks in regions that contain text in the English language may be correctly classified to be text (e.g. by a neural network), one or more blocks 103A, 103B, 103C and 103D (
One or more prior art criteria that are used by a classifier to identify text in natural images can be relaxed, so that blocks 103A-103D are then classified as text, but on doing so one or more portions of another region 105 (
Moreover, when a natural image 107 (
Accordingly, there is a need to improve the identification of regions of text in a natural image or video frame, as described below.
In several aspects of described embodiments, an electronic device and method use a camera to capture an image of an environment outside the electronic device followed by identification of blocks that enclose regions of pixels in the image, with each region being initially included in a single block. Depending on the embodiment, each region may be identified to have pixels contiguous with one another and including a local extrema (maxima or minima) of intensity in the image, e.g. a maximally stable extremal region (MSER). In some embodiments, each block that contains such a region (which may constitute a “connected component”) is tested for presence of a line of pixels binarizable to a common value (“pixel-line-present” block), followed by identification of one or more blocks adjacent thereto which are then tested for merger as follows.
One or more processors of several embodiments execute computer instructions (also called “first instructions”) to test for overlap of projections, between a projection of a pixel-line-present block on to a line (e.g. x-axis) and another projection of an adjacent block on to the same line (e.g. x-axis also). When one or more such test(s) for overlap on a line of projections (also called “supports” or “spans”) of blocks is/are satisfied, these two blocks are automatically merged with one another by one or more processors executing computer instructions (also called “second instructions”), although at the time of merger it is not known whether the blocks being merged contain text or non-text. An additional test that may be performed prior to merger of two blocks may be based on, for example, relative heights of the two blocks, and/or aspect ratio of either or both blocks, etc. Information on a merged block that is obtained as a result of merging of two or more blocks is stored in memory by one or more processors executing computer instructions (also called “third instructions”). The merged block is then processed further in certain embodiments, e.g. subject to verification of presence of a pixel line, and followed by classification of the merged block as text or non-text.
Depending on the embodiment, classification of a merged block (with multiple connected components therein) as text or non-text may use one or more predetermined attributes of the merged block, such as location and thickness of a line of pixels binarizable to a common binary value and oriented longitudinally in the merged block (e.g. parallel to or within a small angle of, whichever side of the block is longer). The just-described classification of the merged block may additionally or alternatively use: a number of lines of pixels oriented laterally in the merged block (e.g. vertically), and a number of black-white transitions (and number of white-black transitions) in a subset of rows located below the line of pixels, when the line of pixels is located in an upper portion of the merged block (e.g. within 30% or 40% of the merged block's height, measured from a top side of the merged block).
In some embodiments, one or more of: identification of blocks, testing for overlap of projections on to a common line, merger of blocks that satisfy tests, followed by text/non-text classification as described above are performed by one or more processor(s) operatively coupled to memory and configured to execute computer instructions stored in the memory (or in another non-transitory computer readable storage media). Moreover, in some embodiments, one or more non-transitory storage media include a plurality of computer instructions, which when executed, cause one or more processors in a handheld device to perform operations, and these computer instructions include computer instructions to perform one or more of: identification of blocks, testing for overlap of projections on to a common line, merger of blocks that satisfy tests, followed by text/non-text classification described above.
In certain embodiments, one or more acts of the type described above are performed by a mobile device (such as a smart phone) that includes a camera, a memory operatively connected to the camera to receive images therefrom, and at least one processor operatively coupled to the memory to execute computer instructions stored in the memory (or in another non-transitory computer readable storage media). On execution of the computer instructions, the processor processes an image to check two blocks that are adjacent to one another in the image for satisfying one or more predetermined rules (e.g. based on geometric attributes of the blocks), and on finding the rule(s) to be satisfied merging the two blocks to generate a merged block, subsequently classifying the merged block as text or non-text, followed by OCR of blocks that are classified as text (in the normal manner). In some embodiments, an apparatus includes several means implemented by logic in hardware or logic in software or a combination thereof, to perform one or more acts described above.
It is to be understood that several other aspects will become readily apparent to those skilled in the art from the description herein, wherein it is shown and described various embodiments by way of illustration. The drawings and detailed description below are to be regarded as illustrative in nature and not as restrictive.
A number of regions of an image of a real world scene (e.g. an image 107 of a newspaper 100 in
Merger software 141 of some embodiments, when executed by one or more processors, identifies blocks of regions in an image (in memory) that can be merged with one another, as described in U.S. application Ser. No. 13/748,539, Attorney Docket No. Q111559Usos, filed concurrently herewith, entitled “Identifying Regions of Text to Merge In A Natural Image or Video Frame” which is incorporated herein by reference in its entirety, above. Blocks that are identified as candidates for merger are thereafter subject to certain predetermined rules (also called clustering rules) as described below, and when these rules are found to be satisfied the blocks are merged, even though it is not known whether the blocks are text or non-text.
Specifically, an image 107 (e.g. a hand-held camera captured image) received by a processor 1013 of mobile device 200 in certain described embodiments, as per act 211 in
After receipt of image 107, processor 1013 in described embodiments identifies, as per act 212 in
Specifically, MSERs that are identified in act 212 of some embodiments are regions that are geometrically contiguous (with any one pixel in the region being reachable from any other pixel in the region by traversal of one or more pixels that contact one another in the region) with monotonic transformation in property values, and invariant to affine transformations (transformations that preserve straight lines and ratios of distances between points on the straight lines). Boundaries of MSERs may be used as connected components in some embodiments described herein, to identify regions of an image, as candidates for recognition as text.
In several of the described embodiments, regions in image 107 are automatically identified in act 212 based on variation in intensities of pixels by use a method of the type described by Matas et al., e.g. in an article entitled “Robust Wide Baseline Stereo from Maximally Stable Extremal Regions” Proc. Of British Machine Vision Conference, pages 384-396, published 2002 that is incorporated by reference herein in its entirety. The time taken to identify MSERs in an image can be reduced by use of a method of the type described by Nister, et al., “Linear Time Maximally Stable Extremal Regions”, ECCV, 2008, Part II, LNCS 5303, pp 183-196, published by Springer-Verlag Berlin Heidelberg that is also incorporated by reference herein in its entirety. Another such method is described in, for example, an article entitled “Robust Text Detection In Natural Images With Edge-Enhanced Maximally Stable Extremal Regions” by Chen et al, IEEE International Conference on Image Processing (ICIP), September 2011 that is incorporated by reference herein in its entirety.
The current inventors note that prior art methods of the type described by Chen et al. or by Matas et al. or by Nister et al. identify hundreds of MSERs, and sometimes identify thousands of MSERs in an image 107 that includes details of natural features, such as leaves of a tree or leaves of plants, shrubs, and bushes. Hence, use of MSER methods of the type described above result in identification of regions whose number depends on the content within the image 107. Moreover, a specific manner in which pixels of a region differ from surrounding pixels at the boundary identified by such an MSER method may be predetermined in some embodiments by use of a lookup table in memory. Such a lookup table may supply one or more specific combinations of values for the parameters Δ and Max Variation, which are input to an MSER method (also called MSER input parameters). Such a lookup table may be populated ahead of time, with specific values for Δ and Max Variation, e.g. determined by experimentation to generate contours that are appropriate for recognition of text in a natural image, such as value 8 for Δ and value 0.07 for Max Variation.
In some embodiments, pixels are identified in a set (which may be implemented in a list) that in turn identifies a region Qi which includes a local extrema of intensity (such as local maxima or local minima) in the image 107. Such a region Qi may be identified in act 212 (
Other methods that can be used to identify such regions in act 212 may be similar or identical to methods for identification of connected components, e.g. as described in an article entitled “Application of Floyd-Warshall Labelling Technique: Identification of Connected Pixel Components In Binary Image” by Hyunkyung Shin and Joong Sang Shin published in Kangweon-Kyungki Math. Jour. 14 (2006), No. 1, pp. 47-55 that is incorporated by reference herein in its entirety, or as described in an article entitled “Fast Connected Component Labeling Algorithm Using A Divide and Conquer Technique” by Jung-Me Park, Carl G. Looney and Hui-Chuan Chen believed to be published in Matrix (2000), Volume: 4, Issue: 1, Publisher: Elsevier LTD, pp 4-7 that is also incorporated by reference herein in its entirety.
A specific manner in which regions of an image 107 are identified in act 212 by mobile device 200 in described embodiments can be different, depending on the embodiment. Each region of image 107 that is identified by use of an MSER method of the type described above is represented by act 212 in the form of a list of pixels, with two coordinates for each pixel, namely the x-coordinate and the y-coordinate in two dimensional space (of the image).
After identification of regions, each region is initially included in a single rectangular block which may be automatically identified by mobile device 200 of some embodiments in act 212, e.g. as a minimum bounding rectangle of a region, by identification of a largest x-coordinate, a largest y-coordinate, a smallest x-coordinate and a smallest y-coordinate of all pixels within the region. The just-described four coordinates may be used in act 212, or subsequently when needed, to identify the corners of a rectangular block that tightly fits the region. As discussed below, such a block (and therefore its four corners) may be used in checking whether a predetermined rule is satisfied, e.g. by one or more geometric attributes of the block relative to an adjacent block (such as overlap of projection (“support”) on a common line). Also, a block's four sides may need to be identified, in order to identify all pixels in the block and their binarizable values, followed by generation of a profile of counts of pixels of a common binary value. When needed, four corners of a rectangular block that includes a region may be identified, e.g. as follows:
After a set of blocks are identified in act 212, such as block 302 in
Processor 1013 of certain embodiments is programmed, in any deterministic manner that will be apparent to the skilled artisan in view of this detailed description, to determine occurrence of pixels that are binarizable to a value 1 (or alternatively to value 0) along a straight line defined by the equation y=mx+c and that satisfy a specific test that is predetermined For example, some embodiments of processor 1013 may be programmed to simply enter the x,y coordinates of all pixels of a block 302 into such an equation, to determine for how many pixels in the block (that are binarizable to value 1) is such an equation satisfied (e.g. within preset limits). For example, to check if there is a straight line that is oriented parallel to the x-axis in block 302, processor 1013 may be programmed to set the slope m=0, then check if there are any pixels in block 302 at a common y co-ordinate (with a value of the constant c in the above-identified equation), which can be binarized to the value 1 (and then repeat for the value 0). In this manner, processor 1013 may be programmed to use a series of values (e.g. integer values) of constant “c” in the equation, to check for presence of lines parallel to the x-axis, at different values of y-coordinates of pixels in block 302.
During operation 220 (of pixel-line-presence detection), processor 1013 of some embodiments performs at least three acts 221-223 as follows. Specifically, in act 221 of several embodiments, processor 1013 is programmed to perform an initial binarization of pixels of a block 302 which is fitted around a region (e.g. the region in
Next, in act 222, processor 1013 is programmed to test for the presence or absence of a straight line passing through positions of the just-described binary-valued pixels (resulting from act 221) in the block. For example, in act 222 of some embodiments, processor 1013 checks whether a test (“pixel-line-presence test” or simply “line-presence test”) is satisfied, for detecting the presence of a line segment 305 (
Along the straight line 304 shown in
In one example, act 222 determines that a pixel line is present in block 302 along straight line 304 when straight line 304 is found to have the maximum number of black pixels (relative to all the lines tested in block 302). In another example, act 222 further checks that the maximum number of black pixels along straight line 304 is larger than a mean of black pixels along the lines being tested by a predetermined amount and if so then block 302 is determined to have a pixel line present therein. The same test or a similar test may be alternatively performed with white pixels in some embodiments of act 222. Moreover, in some embodiments of act 212, the same test or a similar test may be performed on two regions of an image, namely the regions called MSER+and MSER−, generated by an MSER method (with intensities inverted relative to one another).
In some embodiments, block 302 is subdivided into rows oriented parallel to the longitudinal direction of block 302. Some embodiments of act 222 prepare a histogram of counters, based on pixels identified in a list of positions indicative of a region, with one counter being used for each unit of distance (“bin” or “row”) along a height (in a second direction, which is perpendicular to a first direction (e.g. the longitudinal direction)) of block 302. In some embodiments, block 302 is oriented with its longest side along the x-axis, and act 222 is performed by sorting pixels by their y-coordinates followed by binning (e.g. counting the number of pixels) at each intercept on the y-axis (which forms a bin), followed by identifying a counter which has the largest value among counters. Therefore, the identified counter identifies a peak in the histogram, which is followed by checking whether a relative location of the peak (along the y-axis) happens to be within a predetermined range, e.g. top ⅓rd of block height, and if so the pixel-line-presence test is met. So, a result of act 222 in the just-described example is that a pixel line (of black pixels) has been found to be present in block 302.
In several aspects of the described embodiments, processor 1013 is programmed to perform an act 223 to mark in a storage element 381 of memory 1012 (by setting a flag), based on a result of act 222, e.g. that block 302 has a line of pixels present therein (or has no pixel line present, depending on the result). Instead of setting the flag in storage element 381, block 302 may be identified in some embodiments as having a pixel line present therein, by including an identifier of block 302 in a list 1501 of identifiers (
After performance of act 223 (
After operation 220, processor 1013 of some embodiments performs operation 230 wherein a block 302 which has been marked as pixel-line-present is tested for possible merger with one or more blocks that are adjacent to block 302, e.g. by applying one or more predetermined rules. Processor 1013 of some embodiments is programmed to perform an operation 230 (also called “merger operation”) which includes at least three acts 231, 233 and 233 as follows. In act 231, each block which has no intervening block between itself and a pixel-line-present block, and which is located at less than a specified distance (e.g. half of height of pixel-line-present block), is identified and marked in memory 1012 as “adjacent.”
In some embodiments of act 231, mobile device 200 uses each block 302 that has been marked as pixel-line-present in act 222, to start looking for and marking in memory 1012 (e.g. in a list 1502 in
In act 232 of some embodiments, processor 1013 merges a pixel-line-present block with a block adjacent to it, as identified in act 231. On completion of the merged, pixels in the merged block include at least pixels in the pixel-line-present block and pixels in the adjacent block (which may or may not have a pixel line present therein). A specific technique that is used in act 231 to merge two adjacent blocks can be different, depending on the embodiment, etc.
In some embodiments, a first list of positions of pixels of a first region 403R in a block 403 (
After act 233, processor 1013 of some embodiments returns to act 231 to identify an additional block that is located adjacent to the merged block. The additional block which is adjacent to the merged block (e.g. formed by merging a first block and a second block) may be a third block which has a third region therein. Therefore, in act 232 of some embodiments, processor 1013 merges a merged set of positions of the merged block with an additional set of positions of the third region in the third block. Depending on the image, the additional block which is adjacent to a merged block may itself be another merged block (e.g. formed by merging a third block with a third region therein and a fourth block with a fourth region therein). At least one of the third block and the fourth block is marked as pixel-line-present (for these two blocks to have been merged to form the additional block). Hence, act 232 in this example merges two blocks each of which is itself a merged block. Accordingly, the result of act 232 is a block that includes positions of pixels in each of the first block, the second block, the third block and the fourth block.
In some embodiments, act 232 is performed conditionally, only when certain predetermined rules are met by two blocks that are adjacent to one another. Specifically, in such embodiments, whether or not two adjacent blocks can be merged is typically decided by application of one or more rules that are predetermined (called “clustering rules”), which may be based on attributes and characteristics of a specific script, such as Devanagari script. The predetermined rules, although based on properties of a predetermined script of a human language, are applied in operation 230 of some embodiments, regardless of whether the two or more blocks being tested for merger are text or non-text. Different embodiments use different rules in deciding whether or not to merge two blocks, and hence specific rules are not critical to several embodiments. The one or more predetermined rules applied in operation 230 either individually or in combination with one another, to determine whether or not to merge a pixel-line-present block with its adjacent block may be different, depending on the embodiment, and specific examples are described below in reference to
As noted above, it is not known to processor 1013, at the time of performance of operation 220, whether any region(s) in a block 403 (
Depending on the content of the image, a block which is marked by operation 220 as pixel-line-present may have a region representing a non-text feature in the image, e.g. a branch of a tree, or a light pole. Another block of the image, similarly marked by operation 220 as pixel-line-present, may have a region representing a text feature in the image, e.g. text with the format strike-through (in which a line is drawn through middle of text), or underlining (in which a line is drawn through bottom of text), or shiro-rekha (a headline in Devanagari script). So, operation 220 is performed prior to classification as text or non-text, any pixels in the regions that are being processed in operation 220.
In some embodiments, block 302, which is marked in memory 1012 as “pixel-line-present”, contains an MSER whose boundary may (or may not) form one or more characters of text in certain languages. In some languages, characters of text may contain and/or may be joined to one another by a line segment formed by pixels in contact with one another and spanning across a word formed by the multiple characters, as illustrated in
Specifically, in some embodiments, after operation 230, mobile device 200 performs operation 240 which includes several acts that are performed normally prior to OCR, such as geometric rectification of scale (by converting parallelograms into rectangles, re-sizing blocks to a predetermined height, e.g. 48 pixels) and/or detecting and correcting tilt or skew. Hence, depending on the embodiment, a merged block obtained from operation 230 may be subject to skew correction, with or without user input. For example, skew may be detected and corrected via user input as described in U.S. patent application Ser. No. 13/748,562, Attorney Docket No. Q112726USos, filed concurrently herewith, entitled “Detecting and Correcting Skew In Regions Of Text In Natural Images” which is incorporated herein by reference in its entirety, above.
Operation 240 (for verification) of several embodiments further includes re-doing the binarization in act 241 (see
Pixel intensities that are used in binarization and in pixel-line-presence test in operation 240 (
Accordingly, in several embodiments, binarization and pixel-line-presence test are performed twice, a first time in operation 220 and a second time in operation 240. So, a processor 1013 is programmed with computer instructions in some embodiments, to re-do the binarization and pixel-line-presence test, initially on pixels in at least one block and subsequently on pixels in a merged block (obtained by merging the just-described block and one or more blocks adjacent thereto). Note that at the time of performance of each of operations 220 and 240 it is not known whether or not the pixels (on which the operations are being performed) are text or non-text. This is because classification of pixels as text or non-text in operation 250 is performed after performance of both operations 220 and 240. Performing binarization and pixel-line-presence test twice, while the pixels are not yet classified as text/non-text, as described is believed to improve accuracy subsequently, in operations 250 and 260 (described below).
A merged block that passes the pixel-line-presence test in operation 240 is thereafter subject to classification as text or non-text, in an operation 250. Operation 250 may be performed in the normal manner, e.g. by use of a classifier that may include a neural network. Such a neural network may use learning methods of the type described in, for example, U.S. Pat. No. 7,817,855 that is incorporated by reference herein in its entirety. Alternatively, operation 250 may be performed in a deterministic manner, depending on the embodiment.
After operation 250, a merged block that is classified as text is processed by an operation 260 to perform optical character recognition (OCR) in the normal manner. Therefore, processor 1013 supplies information related to a merged block (such as coordinates of the four corners) to an OCR system, in some embodiments. During OCR, processor(s) 1013 of certain embodiments obtains a sequence of sub-blocks from the merged block in the normal manner, e.g. by subdividing (or slicing). Sub-blocks may be sliced from a merged block using any known method e.g. based on height of the text region, and a predetermined aspect ratio of characters and/or based on occurrence of spaces outside the boundary of pixels identified as forming an MSER region but within the text region. The result of slicing in operation 260 is a sequence of sub-blocks, and each sub-block is then subject to optical character recognition (OCR).
Specifically, in operation 260, processor(s) 1013 of some embodiments form a feature vector for each sub-bock and then decode the feature vector, by comparison to corresponding feature vectors of letters of a predetermined alphabet, to identify one or more characters (e.g. alternative characters for each block, with a probability of each character), and use one or more sequences of the identified characters with a repository of character sequences, to identify and store in memory 1012 (and/or display on a touch-sensitive screen 1001 or a normal screen 1002) a word identified as being present in the merged block.
As noted above, it is not known in operation 220, whether or not block 302 (
Note however, that even when text is actually contained in block 302, a line segment 305 of pixels that is detected in operation 220 may be oriented longitudinally relative to a block 302 (
Depending on the font and the script of the text in an image, lines of pixels of a common binary value that traverse a block need not be strictly longitudinal or strictly lateral. Instead, a longitudinally-oriented line of pixels can be but need not be longitudinal. So, a longitudinally-oriented line in a block may be at a small angle (e.g. less than 20° or 10°) relative to a top side (or bottom side) of the block, depending on a pose of (i.e. position and orientation of) camera 1011 relative to a scene. When block 302 has its longitudinal direction oriented parallel (or within the small angle) to the x-axis (e.g. after geometric rectification, scaling and tilt correction), a longitudinal pixel line through block 302 has a constant y coordinate, which is tested in some embodiments by setting slope m to zero and using a series of values of constant “c”, as described above.
A pixel-line-presence test used in act 222 (
In certain illustrative embodiments, the language identified by processor 1013 is Hindi, and the pixel-line-presence test that is selected in act 202 (
On completion of operation 220 (
Accordingly, accuracy in identifying text regions of a natural image (or video frame) is higher when using blocks that have been merged (based on presence of a pixel line in or between multiple characters) than the accuracy of prior art methods known to the inventors. For example, OCR accuracy of block 425 (
As noted above in reference to act 222 of operation 220,
The just-described binarization technique is just one example, and other embodiments may apply other techniques that are readily apparent in view of this disclosure. In a simpler example, the current pixels' intensity may be compared to just a mean intensity across all pixels in block 302, and if the current pixel's intensity exceeds the mean, the current pixel is marked as 1 (in act 314) else the current pixel is marked as 0 (in act 315). Hence, mobile device 200 may be programmed to binarize pixels by 1) using pixels in a block to determine a set of one or more thresholds that depend on the embodiment, and 2) compare each pixel in the block with this set of thresholds, and 3) subsequently set the binarized value to 1 or 0 based on results of the comparison.
On completion of acts 314 and 315, control returns to act 312 to select a next pixel for binarizing, and the above-described acts are repeated when not all pixels in the current row have been visited (as per act 316). When act 316 finds that all pixels in a row of the block have been binarized, the number of pixels with value 1 in binary (e.g. black pixels) in each row “J” of block 302 is counted (as per act 317 in
Instead, after projection count N[J] is computed for all rows of a block 302 to form the histogram, the looping ends, and control transfers to operation 320 that computes attributes at the level of blocks, e.g. in acts 321 and/or 322. In act 321, mobile device 200 identifies a row Hp that contains a maximum value Np of all projection counts N[0]-N[450] in block 302, i.e. the value of peak 308 in graph 310 in the form of a histogram of counts of black pixels (alternatively counts of white pixels). At this stage, a row Hp (e.g. counted in bin 130 in
Thereafter, mobile device 200 checks (in act 331) whether the just-computed values Nm and Np satisfy a preset criterion on intensity of a peak 308. An example of such a peak-intensity preset criterion is Nm/Np≧1.75, and if not then the block 302 is marked as “pixel-line-absent” in act 332 and if so then block 302 may be marked as “pixel-line-present” in act 334 (e.g. in a location in memory 1012 shown in
In some embodiments, the additional preset criterion is on a location of peak 308 relative to a span of block 302 in a direction perpendicular to the direction of projection, e.g. relative to height of block 302. Specifically, a peak-location preset criterion may check where a row Hp (containing peak 308) occurs relative to height H of the text in block 302. For example, such peak-location preset criterion may be satisfied when Hp/H≦r wherein r is a predetermined constant, such as 0.3 or 0.4. Accordingly, presence of a line of pixels is tested in some embodiments within a predetermined rage, such as 30% from an upper end of a block.
When one or more such preset criteria are satisfied in act 334, mobile device 200 then marks the block as “pixel-line-present” and otherwise goes to act 332 to mark the block as “pixel-line-absent.” Although illustrative preset criteria have been described, other such criteria may be used in other embodiments of act 334. Moreover, although certain values have been described for two preset criteria, other values and/or other preset criteria may be used in other embodiments.
Note that a 0.33 value in the peak-location preset criterion described above results in act 334 testing for presence of a peak in the upper ⅓rd region of a block, wherein a pixel line called header line (also called shiro-rekha) is normally present in Hindi language text written in the Devanagari script. However, as will be readily apparent in view of this disclosure, specific preset criteria used in act 334 may be different, e.g. depending on the language and script of text to be detected in an image.
Specifically, in some embodiments, blocks of connected components that contain pixels of text in Arabic are marked as “pixel-line-present” or “pixel-line-absent” in the manner described herein, after applying the following two preset criteria. A first preset criterion for Arabic is same as the above-described peak-intensity preset criterion for Devanagri (namely Nm/Np>1.75). A second preset criterion for Arabic is a modified form of Devanagri's peak-location preset criterion described above.
For example, the peak-location preset criterion for Arabic may be 0.4≦<Hp/H≦0.6, to test for presence of a peak in a middle 20% region of a block, based on profiles for Arabic text shown and described in an article entitled “Techniques for Language Identification for Hybrid Arabic-English Document Images” by Ahmed M. Elgammal and Mohamed A. Ismail, believed to be published 2001 in Proc. of IEEE 6th International Conference on Document Analysis and Recognition, pages 1100-1104, which is incorporated by reference herein in its entirety. Note that although certain criteria are described for Arabic and English (see next paragraph), other similar criteria may be used for text in other languages wherein a horizontal line is used to interconnect letters of a word, e.g. text in the language Bengali (or Bangla).
Furthermore, other embodiments may test for presence of two peaks (e.g. as shown in
Accordingly, while various examples described herein use Devanagari to illustrate certain concepts, those of skill in the art will appreciate that these concepts may be applied to languages or scripts other than Devanagari. For example, embodiments described herein may be used to identify characters in Korean, Chinese, Japanese, Greek, Hebrew and/or other languages.
After marking a block in one of acts 332 and 334, processor 1013 of some embodiments checks (in act 336 in
Image 401 is processed by performing a method of some embodiments as described above, and as illustrated in
More specifically, in act 212, a block 402 (also called “first block”) is identified in the example of in the image 401 with a first plurality of pixels (identified by a first set of positions) that are contiguous with one another and include a first local extrema of intensity in the image 401. Also in act 212, a block 403 (also called “second block”) is identified in the example of
in the image 401 with a second plurality of pixels (identified by a second set of positions) that are contiguous with one another and include a second local extrema of intensity in the image 401. In this manner, each of blocks 402-405 illustrated in
In several of the described embodiments, blocks 402-405 are thereafter processed for pixel line presence detection, as described above in reference to operation 220 (
Next, image 401, has the polarity (or intensity) of its pixels reversed (as would be readily apparent to a skilled artisan, so that white pixels are changed to black and vice versa) and the reversed-polarity version of image 401 is then processed by act 212 (
Next, as per operation 230, each of blocks 402-404 (also called the pixel-line-present blocks) are checked for presence of any adjacent blocks that can be merged. Specifically, on checking the block 402 which is identified as pixel-line-present, for any adjacent blocks, a block 411 (
Clustering rules 503 to be applied in operation 230 may be pre-selected, e.g. based on external input, such as identification of Devanagri as a script in use in the real world scene wherein the image is taken. The external input may be automatically generated, e.g. based on geographic location of mobile device 200 in a region of India, e.g. by use of an in-built GPS sensor as described above. Alternatively, external input to identify the script and/or the geographic location may be received by manual entry by a user. Hence, the identification of a script in use can be done differently in different embodiments.
Based on an externally-identified script, one or more clustering rules 503 (
Merged blocks that are generated by block merging module 141B as described above may themselves be further processed in the manner described above in operation 230. For example, block 421 (also called “merged” block) is used to identify any adjacent block thereto, and block 422 is found. Then, the block 421 (also called “merged” block) and block 422 are evaluated by use of clustering rule(s) in block merging module 141B, and assuming the rules are met in this example, so block 421 (which is a merged block) and block 422 are merged by block merging module 141B to form block 424 (also called “merged” block) in
One or more predetermined rules (“clustering rules”) 503 are used in some embodiments of operation 230 (described above, also called “merger” operation) by block merging module 141B to decide whether or not to merge a block that is known to have a pixel line present therein (such as block 403) in an image, with one or more blocks adjacent to it (such as block 405), by performance of a method illustrated in
Although a specific order of operations is illustrated in
In some embodiments, when any one rule is satisfied, in a corresponding one of the operations 510, 520 and 530, then operation 230 (also called “merger” operation) is performed, regardless of whether or not the blocks have text therein. On completion of operation 230, in some embodiments the merged block is itself marked as pixel-line-present by block merging module 141B, and therefore eligible for selection as the first block in act 501 (followed by act 502 in which an adjacent block is selected as the second block). In several embodiments, operation 230 is performed prior to classification and therefore it is not known to processor 1013 at the time of operations 510, 520 and 530, whether the blocks that are being merged have pixels that represent text or non-text in the image. When no rule is found to be satisfied in any of operations 510, 520 and 530, act 541 is performed to check if all blocks adjacent to the pixel-line-present block (i.e. first block) have been checked, and if not control returns to act 502 and another block that is adjacent to the first block is then selected as the second block.
In act 541, when all blocks that are adjacent to the pixel-line-present block (i.e. first block) have been checked, control transfers to act 542 to check if all pixel-line-present blocks have been checked in the just-described manner and if not, control transfers to act 501 to select another pixel-line-present block as the first block. When all pixel-line-present blocks have been checked (for merger with their respective adjacent blocks), then control transfers to operation 240 (also called “verification” operation) to continue with further processing, such as geometric rectification of scale and/or tilt, followed by binarization of merged blocks, which is then followed by pixel-line-presence test on merged block(s). Operation 240 is followed by operation 250 wherein classification of merged blocks (as well as unmerged blocks) as text or non-text, is performed (as described above), which is then followed by optical character recognition.
Operations 510, 520 and 530 of some embodiments check for of overlap between projections (see projection overlap rules 503P in
As noted above, at this stage, during performance of operations 510, 520 and 530 prior to operation 240, it is not known to processor 1013 whether or not the blocks include text or non-text regions of the image. Applying clustering rules 503 to blocks that happen to be adjacent, one of which has a pixel line present, but neither of which has yet been classified as text/non-text, enables processor 1013 to generate merged blocks on which verification is performed, followed by classification and OCR which is found to be more successful than in the prior art, as described below.
In a first example of applying a clustering rule (e.g. projection overlap rule 503P in
Hence, processor 1013 is programmed to use such clustering rules 503 that are predetermined, as described more completely below, to select two blocks to merge in block merging module 141B, when the two blocks do not overlap one another, regardless of whether the blocks contain text or non-text. Merger of two or more non-overlapping blocks by block merging module 141B s, when the predetermined rules are met as just described, results in a merged block on completion of operation 230 (
Specifically, in one illustrative example, operation 510 includes acts 611-617, described next. In act 611 (
A 100% horizontal projection overlap condition is tested by block merging module 141B in one example of act 611 by use of x-coordinates x1 and x2 of bottom left and bottom right corners of block 621 that is marked pixel-line-present (which identify the horizontal projection of block 621), and x-coordinates x3 and x4 of the bottom left and bottom right corners of block 622 that is adjacent (which identify the horizontal projection of block 622) as follows, is the following condition met: x1<x3<x4<x2 by the x-coordinates of the corners of the two blocks. The just-described condition on overlap of projections is based on geometric attributes of the two blocks subject to the test, namely two specified coordinates (on a coordinate axis, e.g. x-axis) of two specified corners of one block with two specified coordinates of two specified corners of the other block (on the same coordinate axis).
The just-described horizontal projection overlap condition of 100% can be satisfied in some situations (as illustrated in
A left-partial horizontal projection overlap condition is tested by block merging module 141B in one example of act 611 when x3<x1<x4<x2, and the ratio (x4−x1)/(x4−x3) is greater than a predetermined fraction, e.g. 0.5. A right-partial horizontal projection overlap condition is tested by block merging module 141B in another example of act 611, when x1<x3<x2<x4, and the ratio (x2−x3)/(x4−x3) is greater than a predetermined fraction, e.g. also 0.5. The just-described two conditions are also based on geometric attributes of the two blocks, as noted above.
In act 612 (
A 0% vertical projection overlap condition is tested by block merging module 141B in one example of act 612 (see
Alternatively, if the above-described conditions are not met in act 612 mobile device 200 may check a similar condition under the assumption that the second block is located in the image below the first block, by block merging module 141B using the bottom left y-coordinate yl of block 621 (also called “first” block,
Another predetermined test in such a clustering rule, e.g. aspect ratio rule 503A (
Yet another predetermined test in a clustering rule, such as a relative heights rule 503R (
Still another first type of clustering rule, such as spacing rule 503S may cause block merging module 141B to check, as per act 615 performed by mobile device 200 of some embodiments, that the location of the adjacent (second) block (e.g. block 622 in
When a result of act 615 is that the adjacent (second) block is located above the pixel-line-present (first) block, mobile device 200 performs act 617 and else performs act 616. In act 617, block merging module 141B in mobile device 200 checks if the distance between the adjacent (second) block and the pixel-line-present (first) block is less than Thresh5*Word Height, wherein Thresh5 is an above-block limit (also called “first predetermined limit”) that is predetermined empirically, and Word Height is the height of the pixel-line-present (first) block (e.g. vertical projection 621V in
In act 616, block merging module 141B in mobile device 200 checks if the distance between the adjacent (second) block and the pixel-line-present (first) block is less than Thresh6*Word Height, wherein Thresh6 is a below-block limit (also called “second predetermined limit”) that is also predetermined empirically. The just-described condition is once again based on geometric attributes of the two blocks, namely vertical separation (or gap) below the pixel-line-present block. If the answer is yes in either of acts 616 and 617, control transfers to operation 230, and otherwise control transfers to operation 520. In some embodiments, the acts 614 and 615 may further check that the adjacent (second) block is marked as pixel-line-absent.
At this stage, in operation 520, as noted above it is not known to processor 1013 (at this stage) whether the blocks have text or non-text. Even so, a second clustering rule (
For example, one test in the second clustering rule, such as projection overlap rule 503P may cause block merging module 141B to check for 0% horizontal projection overlap and 95% vertical projection overlap between the two pixel-line-present blocks (e.g. see acts 711 and 712 in
In the example of
A third clustering rule (
Specifically, as illustrated in
Furthermore, another test in the third clustering rule may cause block merging module 141B to check, as per act 814 that the aspect ratio (i.e. the ratio Length/Breadth) of the second block is between 0.7 and 0.9 (denoting a half-character of smaller width than a single character) while the aspect of the first (pixel-line-present) block is greater than 2 (denoting multiple characters of greater width than a single character). In the example of
Moreover, in some aspects of the described embodiments, classification of blocks into text or non-text is performed by use of a neural network in operation 230 using parameters 911-915 illustrated in
An example of another attribute of a merged block is indicative of another mean of another number of transitions in a predetermined direction (e.g. longitudinal direction), from the second binary value to the first binary value (e.g. from value 1 to value 0), in a row in a set of rows. Specifically, in some embodiments, during classification, two numbers are counted, namely white-to-black transitions and black-to-white transitions in a predetermined direction, with each number being another attribute of the merged block. Some embodiments use an attribute of the merged block that is indicative of a ratio of (A) a mean of a number of transitions in a predetermined direction (e.g. horizontal direction) from a first binary value (e.g. value 1 for a black colored pixel) to a second binary value (e.g. value 0 for a white colored pixel), in each row in a set of rows in the merged block and (B) a width of the merged block.
In one illustrative example, two numbers of transition are counted for a subset of rows in the merged block that are located at specified position(s) relative to a position of a peak in the block (e.g. at which the header line of a word of text in Hindi occurs, if present in a pixel-line-present block), as follows. In the illustrative example, a peak's position (relative to a vertical span of the block) may be used to identify rows in the block that are located below the peak by at least a predetermined distance (e.g. specified as a percentage of the block height) as belonging to the subset. In a subset of rows that are identified by use of the block's pixel line, in some embodiments, two types of transitions are averaged (namely a number of transitions from value 0 in binary to value 1 in binary and another number of transitions from value 1 in binary to value 0 in binary), and the resulting means (i.e. averages) are used as parameters 914 and 915 which are input to a neural network classifier, e.g. implemented by a processor executing the classifier software 552 used in operation 250. As noted below, a neural network classifier is just one example of the type of classifier that can be programmed to use one or more of parameters 911-915 in different aspects of the described embodiments.
In some embodiments, the operation 220 (for pixel line presence detection) and operation 230 are performed assuming that a longitudinal direction of a connected component of text is well-aligned (e.g. within angular range of +5° and −5°) relative to the longitudinal direction of the block containing that connected component. Accordingly, in such embodiments, blocks in which the respective connected components are misaligned may not be marked as “pixel-line-present” and therefore not be merged with their adjacent “pixel-line-absent” blocks.
Accordingly, in some embodiments, skew of one or more connected components relative to blocks that contain them may be identified by performing geometric rectification(e.g. re-sizing blocks), and skew correction (of the type performed in operation 240). Specifically, an operation 270 to detect and correct skew is performed in some embodiments as illustrated in
In some embodiments, processor 1013 is programmed to select blocks based on variance in stroke width and automatically detect skew of selected blocks as follows. Processor 1013 checks whether at a candidate angle, one or more attributes of projection profiles meet at least one test for presence of a straight line of pixels, e.g. test for presence of straight line 304 (
Classification of the type described herein in operation 250 may be implemented using machine learning methods (e.g. neural networks) as described in a webpage at http://en.wikipedia.org/wiki/Machine_learning. Other methods of classification in operation 240 that can also be used are described in, for example the following, each of which is incorporated by reference herein in its entirety:
Several operations and acts of the type described herein are implemented by a processor 1013 (
Also, mobile device 200 may additionally include a graphics engine 1004, an image processor 1005, and a position processor. In addition to memory 1012, mobile device 200 may include one or more other types of memory such as flash memory (or SD card) 1008 and/or a hard disk and/or an optical disk (also called “secondary memory”) to store data and/or software for loading into memory 1012 (also called “main memory”) and/or for use by processor(s) 1013.
Mobile device 200 may further include a circuit 1010 (e.g. with wireless transmitter and receiver circuitry therein) and/or any other communication interfaces 1009. A transmitter in circuit 1010, which may be an IR or RF transmitter or a wireless a transmitter enabled to transmit one or more signals over one or more types of wireless communication networks such as the Internet, WiFi, cellular wireless network or other network.
It should be understood that mobile device 200 may be any portable electronic device such as a cellular or other wireless communication device, personal communication system (PCS) device, personal navigation device (PND), Personal Information Manager (PIM), Personal Digital Assistant (PDA), laptop, camera, smartphone, tablet (such as iPad available from Apple Inc) or other suitable mobile platform that is capable of creating an augmented reality (AR) environment.
Note that input to mobile device 200 can be in video mode, where each frame in the video is equivalent to the image input which is used to identify connected components, and to compute a skew metric as described herein. Also, the image used to compute a skew metric as described herein can be fetched from a pre-stored file in a memory 1012 of mobile device 200.
A mobile device 200 of the type described above may include an optical character recognition (OCR) system as well as software that uses “computer vision” techniques. The mobile device 200 may further include, in a user interface, a microphone and a speaker (not labeled) in addition to touch-sensitive screen 1001 or normal screen 1002 for displaying captured images and any text/graphics to augment the images. Of course, mobile device 200 may include other elements unrelated to the present disclosure, such as a read-only-memory 1007 which may be used to store firmware for use by processor 1013.
Mobile device 200 of some embodiments includes, in memory 1012 (
Furthermore, a pixel line presence tester 141T (
Moreover, a pixel line presence marker 141M (
Furthermore, an adjacent block identifier 141A (
In some embodiments, software 141 may include a classifier software 552 that when executed by processor 1013 classifies unmerged blocks and/or merged blocks as text or non-text (after binarization based on pixel values in image 107 to identify connected components therein), and any block classified as text is supplied to OCR software 551.
Although various aspects are illustrated in connection with specific embodiments for instructional purposes, the described embodiments are not limited thereto. For example, although mobile device 200 shown in
As noted above, in some embodiments, when a limit on time spent in processing an image as per the method of
Moreover, in certain embodiments, processor 1013 may check for presence of a line of pixels oriented differently (e.g. located in a column in the block) depending on the characteristics of the language of text that may be included in the image.
Although a test for pixels arranged in a straight line has been described in some embodiments, as will be readily apparent in view of this detailed description, such a line need not be straight in other embodiments (e.g. a portion of the line inside a block may be wavy, or form an arc of a circle or ellipse).
Note that input to mobile device 200 can be in video mode, where each frame in the video is equivalent to the image input which is used to identify blocks of connected components and to check for overlap as described herein. Also, the image used to compute a skew metric as described herein can be fetched from a pre-stored file in a memory 1012 of mobile device 200.
Depending on the embodiment, various functions of the type described herein may be implemented in software (executed by one or more processors or processor cores) or in dedicated hardware circuitry or in firmware, or in any combination thereof. Accordingly, depending on the embodiment, any one or more of pixel line presence tester 141T, pixel line presence marker 141M, adjacent block identifier 141A, block merging module 141B (including computer instructions to implement the clustering rules 503, such as projection overlap rules 503P, relative heights rules 503R, aspect ratio rules 503A and spacing rules 503S), storage module 141S and verification module 141V illustrated in
Accordingly, in some embodiments, block merging module 141B (including computer instructions to implement the clustering rules 503) implements means for checking whether a first block and a second block that are adjacent to one another and do not overlap are such that a first projection of the first block on a straight line and a second projection of the second block on the straight line satisfy a test for overlap. Moreover, block merging module 141B of several such embodiments additionally implements means for merging the first block and the second block to obtain a merged block, based at least on an outcome of the test for overlap. In certain embodiments, storage module 141S implements means for storing in at least one memory, information related to the merged block, which information is received from block merging module 141B.
Hence, methodologies described herein may be implemented by various means depending upon the application. For example, these methodologies may be implemented in firmware in read-only-memory 1007 (
Any machine-readable medium tangibly embodying computer instructions may be used in implementing the methodologies described herein. For example, software 141 (
One or more non-transitory computer readable media include physical computer storage media. A computer readable medium may be any available medium that can be accessed by a computer. By way of example, and not limitation, non-transitory computer readable storage media can comprise RAM, ROM, Flash Memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store program code in the form of software instructions (also called “processor instructions” or “computer instructions”) or data structures and that can be accessed by a computer; disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of one or more non-transitory computer readable storage media.
Although certain aspects are illustrated in connection with specific embodiments for instructional purposes, the described embodiments are not limited thereto. Hence, although mobile device 200 shown in
Various adaptations and modifications may be made without departing from the scope of the described embodiments. Therefore, the spirit and scope of the appended claims should not be limited to the foregoing description.
This application claims priority under 35 USC §119 (e) from U.S. Provisional Application No. 61/590,966 filed on Jan. 26, 2012 and entitled “Identifying Regions Of Text To Merge In A Natural Image or Video Frame”, which is assigned to the assignee hereof and which is incorporated herein by reference in its entirety. This application claims priority under 35 USC §119 (e) from U.S. Provisional Application No. 61/590,983 filed on Jan. 26, 2012 and entitled “Detecting and Correcting Skew In Regions Of Text In Natural Images”, which is assigned to the assignee hereof and which is incorporated herein by reference in its entirety. This application claims priority under 35 USC §119 (e) from U.S. Provisional Application No. 61/590,973 filed on Jan. 26, 2012 and entitled “Rules For Merging Blocks Of Connected Components In Natural Images”, which is assigned to the assignee hereof and which is incorporated herein by reference in its entirety. This application claims priority under 35 USC §119 (e) from U.S. Provisional Application No. 61/673,703 filed on Jul. 19, 2012 and entitled “Automatic Correction of Skew In Natural Images and Video”, which is assigned to the assignee hereof and which is incorporated herein by reference in its entirety. This application is also related to U.S. application Ser. No. 13/748,562, Attorney Docket No. Q112726USos, filed concurrently herewith, entitled “Detecting and Correcting Skew In Regions Of Text In Natural Images” which is assigned to the assignee hereof and which is incorporated herein by reference in its entirety. This application is also related to U.S. application Ser. No. 13/748,539, Attorney Docket No. Q111559USos, filed concurrently herewith, entitled “Identifying Regions of Text to Merge In A Natural Image or Video Frame” which is assigned to the assignee hereof and which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61590966 | Jan 2012 | US | |
61590983 | Jan 2012 | US | |
61590973 | Jan 2012 | US | |
61673703 | Jul 2012 | US |