The present application claims the benefit of priority of SG application No. 201204779-1 filed on Jun. 27, 2012, the contents of which are incorporated herein by reference for all purposes.
Embodiments relate generally to text detection devices and text detection methods.
Detecting text from scene images is an important task for a number of computer vision applications. By recognizing the detected scene text many of which are often related to the names of roads, buildings, and other landmarks, users may get to know a new environment quickly. In addition, scene text may be related to certain navigation instructions that may be helpful for autonomous navigation applications such as unmanned vehicle navigation and robotic navigation in urban environments. Furthermore, semantic information may be derived from the detected scene text which may be useful for the content-based image retrieval. Thus, there may be a need for reliable and efficient text detection from scene images.
According to various embodiments, a text detection device may be provided. The text detection device may include: an image input circuit configured to receive an image; an edge property determination circuit configured to determine a plurality of edge properties for each of a plurality of scales of the image; and a text location determination circuit configured to determine a text location in the image based on the plurality of edge properties for the plurality of scales of the image.
According to various embodiments, a text detection method may be provided. The text detection method may include: receiving an image; determining a plurality of edge properties for each of a plurality of scales of the image; and determining a text location in the image based on the plurality of edge properties for the plurality of scales of the image.
In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments are described with reference to the following drawings, in which:
Embodiments described below in context of the devices are analogously valid for the respective methods, and vice versa. Furthermore, it will be understood that the embodiments described below may be combined, for example, a part of one embodiment may be combined with a part of another embodiment.
In this context, the text detection device as described in this description may include a memory, which is for example used in the processing carried out in the text detection device. A memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
In an embodiment, a “circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a “circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor (e.g. a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor). A “circuit” may also be a processor executing software, e.g. any kind of computer program, e.g. a computer program using a virtual machine code such as e.g. Java. Any other kind of implementation of the respective functions, which will be described in more detail below may also be understood as a “circuit” in accordance with an alternative embodiment.
Text may convey high-level semantics unique to humans in communication with others and the environment. Although there may be good solutions for OCR (optical character recognition) on localized text, unconstrained text detection is a unique human intelligent function, which is still very hard for machines.
According to various embodiments, an accurate scene text detection technique may be provided that may make use of image edges within a blackboard or (whiteboard) architectural model. According to various embodiments, various edge features (which may also be referred to as edge properties), for example six edge features, as knowledge sources may first be extracted from each color component image at each specific scale each of which may capture one text-specific image/shape characteristics. The extracted edge features may then be combined into a text probability map by several integration strategies where edges of scene text may be enhanced whereas those of non-text objects may be suppressed consistently. Finally, scene text may be located within the constructed text probability map through the incorporation of knowledge of text layout. The devices and methods according to various embodiments have been evaluated over a public benchmarking dataset and good performance has been achieved. The devices and methods according to various embodiments may be used in different applications such as human computer interaction, autonomous robot navigation and business intelligence.
According to various embodiments, devices and methods for accurate scene text detection through structural image edge analysis may be provided.
In other words, an image may be input to the text detection device. Then, for a plurality of scales of the input image, the text detection device may determine a plurality of edge properties (for example, a plurality of edge properties may be determined for a first scale of the image, and a plurality of edge properties may be determined for a second scale of the image, and so on). For each scale, the plurality of edge properties may be the same or may be different. Then, based on the plurality of edge properties for the plurality of scales, a location of a text in the image may be determined.
According to various embodiments, the plurality of edge properties may include or may be an edge gradient property and/or an edge linearity property and/or an edge openness property and/or an edge aspect ratio property and/or an edge enclosing property and/or an edge count property.
According to various embodiments, the plurality of scales may include or may be a reduced scale and/or an original scale and/or an enlarged scale.
According to various embodiments, the image input circuit 102 may be configured to receive an image including a plurality of color components. The edge property determination circuit 104 may further be configured to determine the plurality of edge properties for each of the plurality of scales of the image for the plurality of color components of the image.
According to various embodiments, the text location determination circuit 106 may further be configured to determine the text location in the image based on a knowledge of text format and layout.
The knowledge of text format and layout may include or may be: a threshold on a projection profile and/or a threshold on a ratio between text line height and image height and/or a threshold on a ratio between text line length and the maximum text line length within the same scene image and/or a threshold on a ratio between the maximum variation and the mean of the projection profile of a text line and/or a threshold on a ratio between character height and the corresponding text line height and/or a ratio between inter-character distance within a word and the corresponding text line height.
According to various embodiments, the image input circuit 102 may be configured to receive an image including a plurality of pixels. Each edge property of the plurality of edge properties may include or may be, for each pixel of the plurality of pixels, a probability of text at a position of the pixel in the image. In other words, the edge properties may define a plurality of edge feature images for each color and each scale. Combining the edge features for one color and one scale may define a feature image for the one color and the one scale.
According to various embodiments, the text location determination circuit may be configured to determine for each pixel of the plurality of pixels a probability of text at a position of the pixel in the image based on the plurality of edge properties for the plurality of scales of the image. In other words: a probability map may be determined based on the edge properties, for example based on the feature images.
According to various embodiments the edge determination circuit 112 may be configured to determine edges in the image. The edge property determination circuit 104 may be configured to determine the plurality of edge properties based on the determined edges.
According to various embodiments, the projection profile determination circuit 114 may be configured to determine a projection profile based on the plurality of edge properties.
According to various embodiments, the text location determination circuit 106 may further be configured to determine the text location in the image based on the projection profile.
According to various embodiments the plurality of edge properties may include or may be an edge gradient property and/or an edge linearity property and/or an edge openness property and/or an edge aspect ratio property and/or an edge enclosing property and/or an edge count property.
According to various embodiments, the plurality of scales may include or may be a reduced scale and/or an original scale and/or an enlarged scale.
According to various embodiments, an image including a plurality of color components may be received. The plurality of edge properties may be determined for each of the plurality of scales of the image for the plurality of color components of the image.
According to various embodiments, the text location in the image may be determined based on a knowledge of text format and layout.
The knowledge of text format and layout may include or may be: a threshold on a projection profile and/or a threshold on a ratio between text line height and image height and/or a threshold on a ratio between text line length and the maximum text line length within the same scene image and/or a threshold on a ratio between the maximum variation and the mean of the projection profile of a text line and/or a threshold on a ratio between character height and the corresponding text line height and/or a ratio between inter-character distance within a word and the corresponding text line height.
According to various embodiments, an image including a plurality of pixels may be received. Each edge property of the plurality of edge properties may for each pixel of the plurality of pixels include or be a probability of text at a position of the pixel in the image.
According to various embodiments, for each pixel of the plurality of pixels a probability of text at a position of the pixel in the image may be determined based on the plurality of edge properties for the plurality of scales of the image.
According to various embodiments, the text detection method may further include: determining edges in the image. The plurality of edge properties may be determined based on the determined edges.
According to various embodiments, the text detection method may further include: determining a projection profile based on the plurality of edge properties.
According to various embodiments, the text location in the image may be determined based on the projection profile.
Detecting text from scene images may be an important task for a number of computer vision applications. By recognizing the detected scene text many of which may be related to the names of roads, buildings, and other landmarks, as illustrated in
Commonly used scene text detection methods may be broadly classified into three categories, namely, texture-based methods, region-based methods, and stroke-based methods. Texture-based methods may classify image pixels based on different text properties such as high edge density and high intensity variation. Region-based methods may first group image pixels into regions based on specific image properties such as constant color and then classify the grouped regions into text and non-text. Stroke-based methods may make use of character strokes that usually have little stroke width variation. Though scene text detection has been studied extensively, it is still an unsolved problem due to the large variation of scene text in term of text sizes, orientations, image contrast, scene contexts, etc. Two competitions have been held to record advances in scene text detection. The competitions are based on a benchmarking dataset that consists of 509 natural images with text. The low performance achieved (top recall at 67% and top precision at 62%) also suggests that there is still a big room for improvement, especially compared with another closely related area that deals with the detection and recognition of scanned document text.
According to various embodiments, devices and methods may be provided for scene text detection technique which may make use of knowledge of text layout and several discriminative edge features. For example, the devices and methods according to various embodiments may implement a multi-scale detection architecture that may be suitable for the text detection from natural images. Furthermore, according to various embodiments, six discriminative edge features may be designed that can be integrated to differentiate edges of text and non-text objects consistently. Compared with pixel-level texture or region features, the edge features according to various embodiments may be more capable of capturing the prominent shape characteristics associated with the text. In addition, the combination of the six edge features may be more discriminative than the usage of the stroke width feature alone. The devices and methods according to various embodiments may outperform most commonly used methods and may achieve a superior detection precision and recall of 81% and 66%, respectively, for a widely used public benchmarking dataset.
According to various embodiments, devices and methods may be based on structural edge features, and image edges may be first detected. The edges may be detected by using any commonly known edge detector, for example Canny's edge detector, which may be robust to uneven illumination and capable of connecting edge pixels of the same object. The detected edges may then be pre-processed to facilitate the ensuing edge feature extraction. First, edge pixels, for example all edge pixels, may be removed, for example if they are connected to more than two edges pixels within a 3×3 8-connectivity neighborhood window. This may break edges at the edge pixels that have more than 2 branches which may be detected from noisy background or touching characters. Next, image edges may be labeled through connected component analysis and those with a small size may be removed. For example, the threshold size may be set at 20 as text edges may usually consist of more than 20 pixels.
One or more edge features (for example six edge features) may then be derived from edges, for example from edges of each color component image at each image scale. Each derived edge feature may give the probability of whether the edge is a text edge or non-text edge which may later be integrated to build a text probability map. It will be understood that not all of the six edge features need to be present, but rather at least one of them may be present. However, any number of edge features may be present, even all six edge features, or further edge features not described below may be present.
The first (edge) feature E1, which may also be referred to as an edge gradient property, may capture the image gradient as follows:
where Ge may be a vector that may store the gradient of all edge pixels, μ(Ge) may denote the mean of Ge, and σ(Ge) may denote the standard deviation of Ge. Compared with non-text edges, text edges may often have a larger value of E1, because text edges may usually have higher but more consistent image gradient (and hence a larger numerator and a smaller denominator in E1).
The second (edge) feature E2, which may also be referred to as an edge linearity property, may capture the edge linearity that may be estimated by the distance between an edge pixel and its counterpart. For each edge pixel E(xi, yi) of an edge E, its counterpart pixel E(x′i, y′i) may be detected by the nearest intersection between E and a straight line L that passes through E(xi, yi) and has the same orientation as that of the image gradient at E(xi, yi). It should be noted that E(x′i, y′i) may be determined by the nearest intersection to E(xi, yi) as more than one intersection may be detected between E and L. The second feature is defined as follows:
where H(d) may be the histogram of the distance d between an edge pixel and its counterpart. The H(d) of an edge is determined as follows. For each edge pixel p, a straight line 1 is determined that passes through p along the orientation of the image gradient at p. The distance between p and the first probed edge pixel (by 1 in either direction), if existed, is counted as one stroke width candidate and used to update the H(d). The H(d) of the edge is constructed when all edge pixels are examined as described. Max(H(d) may return the peak frequency of d and argmaxMax(H(d)) may return the d with the peak frequency. Ew may denote the width of the edge, and Eh may denote the height of the edge. Compared with non-text edges, text edges may usually have a much larger value of E, due to the small variation of the character stroke width and a small ratio between the stroke width and the edge size.
The third (edge) feature E3, which may also be referred to as an edge openness property, may capture the edge openness. As described above, each edge may have a pair of ending pixels if it is not closed and otherwise zero (for example zero ending pixels) after the edge breaking. The edge openness may be evaluated based on the Euclidean distance between the ending pixels of an edge component at (x1, y1) and (x2, y2) as follows:
where MXL may denote the major axis length of the edge component (for normalization). Compared with non-text edges, text edges may usually have a larger value of E3 as text edges may often be closed or their ending pixels are close.
The fourth (edge) feature E4, which may also be referred to as an edge aspect ratio property, may be defined by the edge aspect ratio. As scene text may be captured in arbitrary orientations, E4 may be defined by the ratio between the minor axis length and major axis length of the image edge as follows:
where MXL may denote the major axis length of the edge, and MNL may denote the minor axis length of the edge. Compared with non-text edges, text edges may usually have a larger value of E4 because its MNL and MXL may usually be close to each other.
The fifth (edge) feature E5, which may also be referred to as an edge enclosing property, may capture the edge enclosing property that each text component usually does not enclose too many other isolated text components. It may be defined as follows:
where t may denote the number of the edge components enclosed by the edge component under study. T may be a number threshold that may for example be set at 4 (as each text edge for example seldom may enclose more than 4 other text edges).
The sixth (edge) feature E6, which may also be referred to as an edge count property, may be based on the observation that each character may usually have more than one stroke (and hence two edge counts) in either horizontal or vertical direction. E6 may be evaluated based on the number of rows and columns of the edge that have more than two edge counts as follows:
where the function f(cn) may be defined as follows:
where cni may denote edge counts of the i-th edge row, and cnj may denote edge counts of the j-th edge column. The edge count along one edge row (or edge column) is the number of intersections between the edge pixels and a horizontal (or vertical) scan line along that edge row. Note that only one intersection is counted when multiple connected and continuous horizontal (or vertical) edge pixels intersect with the horizontal (or vertical) scan line. Compared with non-text edges, text edges may often have a larger value of E6 as they usually have a larger number edge counts.
Several integration strategies may be implemented to combine the derived (edge) features into a text probability map. Instead of using edge features from the grayscale image, edge features from three color component images may be combined, i.e., ER1, . . . , ER6 (representing the six features related to the red color component), EG1, . . . , EG6 (representing the six features related to the green color component), and EB1, . . . , EB6 (representing the six features related to the blue color component), so as to obtain a feature image for each scale and each color as illustrated in
As each edge feature may give the probability of being text edges, a feature image may first be determined through the multiplication of the six edge features from each color component image at one specific image scale as follows:
Fi,j=Πk=16Ei,j,k [7]
where Ei,j,k, i=1, . . . 6, j=1, . . . , 3, k=1, . . . , 6 may denote the k-th edge feature that is derived from edges of the j-th color component image at the i-th image scale. For each color scene image at one specific image scale, three feature images, i.e., FR (for red), FG (for green), and FB (for blue) as illustrated in
Once the feature image is determined, each edge may further be smoothed by its neighboring edges that are detected based on knowledge of text layout. For example, for each edge E, its neighboring edges En may be detected based on three layout criteria including: 1) the centroid distances between E and En in both horizontal and vertical direction is smaller than half of the sum of their major axis length; 2) the centroid of E/En must be higher/lower than the lowest/highest pixel of En/E in both horizontal and vertical directions; 3) the width/height ratio of E and En should lie within a certain range (for example [⅛ 8]). Once En is determined, the value of E may be replaced by the maximum value of En if it is larger than the maximum value of En and otherwise may keep unchanged. The smoothing may help to suppress isolated non-text edges that have a high feature value. It may have little effects on edges of scene text as characters often appear close to each other and their edges usually have a high probability value.
For example, finally, the feature images of different color component images at different scales may be integrated into a text probability map by max-pooling and averaging as follows:
where S may denote the number of image scales and Fi,j may be the feature image in Equation (7). As Equation (8) shows, the three feature images at each image scale may first be combined through max-pooling denoted by fMAX( ) that may return the maximum of the three feature images at each edge pixel. The max-pooling may ensure that the edge features that best capture the text-specific shape characteristics may be preserved. In addition, an averaging may be implemented to make sure that the edge features with a prominent feature value at different scales can be preserved as well.
With the determined text probability map, scene text may be located based on a set of predefined text layout rules including:
To integrate knowledge of text layout, multiple projection profiles P at a step-angle of 1 degree are first determined. The orientation of text lines may be determined by the projection profile P1 with the maximum variance as specified in Rule 1. Multiple text line candidates are then determined by sections within P1 whose values are larger than the mean of P1. The projection profile of an image is an array that stores the accumulated image value along one specific direction. Take the projection profile along the horizontal direction as an example. The project profile will be an array (whose element number is equal to the image height) where each array element stores the accumulated image value along one image row.
The true text lines may then further be identified based on Rules 2, 3, and 4. First, sections with an ultra-small length may be removed with a ratio threshold of 1/200, as text line height is much larger than 1/200 of image height. Next, sections with an ultra-small section mean may be removed with a ratio threshold of 1/20, as text line length is much larger than 1/20 of the maximum text line length. Last, sections with no sharp variation may be removed with a threshold of 1/10, as the maximum variation for a text line is much larger than 1/10 of the mean of the corresponding candidate section.
The detected text lines may then be binarized to locate words. The threshold for each pixel within the detected text lines may be estimated by the larger between a global threshold T1 and a local threshold T2(x, y) that may be estimated as follows:
where T1 may be the mean of all edge pixels with a positive value that usually lies between the probability values of text and non-text edges. It may be used to exclude most non-text edges within the detected text lines. T2(x, y) may be estimated, for example by Niblack's adaptive thresholding method within a neighborhood window.
Words may finally be located based on Rules 5 and 6. First, the binary edges with an extra-small height may be removed with a ratio threshold at 0.4 because character height is usually much larger than 0.4 of text line height. Next, the binary edges with an extra-small distance to their nearest neighbor may be removed with a ratio threshold at 0.2 because inter-character distance is usually smaller than 0.2 of text line height. Finally, words may be located by grouping the remaining binary edge components whose distance to the nearest neighbor is larger than 0.2 of the text line height.
The devices and methods according to various embodiments may be evaluated over a public dataset that was widely used for scene text detection benchmarking and has also been used in the two established text detection contests.
Devices and methods according to various embodiments may be used in different applications such as robotic navigation, unmanned vehicle navigation, business intelligence, surveillance, and augmented reality. For example, the devices and methods according to various embodiments may be used in detecting and recognizing numerals or numbers printed or inscribed on an article, for example, a container, a box or a card.
While the invention has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.
While the preferred embodiments of the devices and methods have been described in reference to the environment in which they were developed, they are merely illustrative of the principles of the inventions. The elements of the various embodiments may be incorporated into each of the other species to obtain the benefits of those elements in combination with such other species, and the various beneficial features may be employed in embodiments alone or in combination with each other. Other embodiments and configurations may be devised without departing from the spirit of the inventions and the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
SG201204779-1 | Jun 2012 | SG | national |