Embodiments generally relate to computer vision systems. More particularly, embodiments relate to a scene text detector for unconstrained environments.
Text information in unconstrained environments may appear in a variety of places such as guide posts, product names, street numbers, etc., and such text may be quite useful in a person's daily life. Such text information may convey useful information about the environment. There may be two important steps in understanding scene text by a computer vision system. Namely, scene text detection (e.g., localizing the text) and scene text recognition. Due to complex backgrounds, variations of text font, size, color, orientation, etc., and variations of the environments, scene text detection for computer vision systems has much room for improvement.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
Turning now to
Embodiments of each of the above processor 11, memory 12, logic 13, and other system components may be implemented in hardware, software, or any suitable combination thereof. For example, hardware implementations may include configurable logic such as, for example, programmable logic arrays (PLAs), field programmable gate arrays (FPGAs), complex programmable logic devices (CPLDs), or fixed-functionality logic hardware using circuit technology such as, for example, application specific integrated circuit (ASIC), complementary metal oxide semiconductor (CMOS) or transistor-transistor logic (TTL) technology, or any combination thereof.
Alternatively, or additionally, all or portions of these components may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., to be executed by a processor or computing device. For example, computer program code to carry out the operations of the components may be written in any combination of one or more operating system (OS) applicable/appropriate programming languages, including an object-oriented programming language such as PYTHON, PERL, JAVA, SMALLTALK, C++, C# or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. For example, the memory 12, persistent storage media, or other system memory may store a set of instructions which when executed by the processor 11 cause the system 10 to implement one or more components, features, or aspects of the system 10 (e.g., the logic 13, to identifying a core text region, a supportive text region, and a background region of an image based on semantic segmentation, detecting text in the image based on the identified core text region and supportive text region, etc.).
Turning now to
Embodiments of logic 22, and other components of the apparatus 20, may be implemented in hardware, software, or any combination thereof including at least a partial implementation in hardware. For example, hardware implementations may include configurable logic such as, for example, PLAs, FPGAs, CPLDs, or fixed-functionality logic hardware using circuit technology such as, for example, ASIC, CMOS, or TTL technology, or any combination thereof. Additionally, portions of these components may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., to be executed by a processor or computing device. For example, computer program code to carry out the operations of the components may be written in any combination of one or more OS applicable/appropriate programming languages, including an object-oriented programming language such as PYTHON, PERL, JAVA, SMALLTALK, C++, C# or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
Turning now to
Embodiments of the method 30 may be implemented in a system, apparatus, computer, device, etc., for example, such as those described herein. More particularly, hardware implementations of the method 30 may include configurable logic such as, for example, PLAs, FPGAs, CPLDs, or in fixed-functionality logic hardware using circuit technology such as, for example, ASIC, CMOS, or TTL technology, or any combination thereof. Alternatively, or additionally, the method 30 may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., to be executed by a processor or computing device. For example, computer program code to carry out the operations of the components may be written in any combination of one or more OS applicable/appropriate programming languages, including an object-oriented programming language such as PYTHON, PERL, JAVA, SMALLTALK, C++, C# or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
For example, the method 30 may be implemented on a computer readable medium as described in connection with Examples 19 to 24 below. Embodiments or portions of the method 30 may be implemented in firmware, applications (e.g., through an application programming interface (API)), or driver software running on to an operating system (OS).
Turning now to
Embodiments of the scene text detection network 41, the post-processor 42, and other components of the scene text detector 40, may be implemented in hardware, software, or any combination thereof. For example, hardware implementations may include configurable logic such as, for example, PLAs, FPGAs, CPLDs, or fixed-functionality logic hardware using circuit technology such as, for example, ASIC, CMOS, or TTL technology, or any combination thereof. Additionally, portions of these components may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., to be executed by a processor or computing device. For example, computer program code to carry out the operations of the components may be written in any combination of one or more OS applicable/appropriate programming languages, including an object-oriented to programming language such as PYTHON, PERL, JAVA, SMALLTALK, C++, C# or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
Some embodiments may advantageously provide scene text detection in unconstrained environments. For example, some embodiments may divide text regions into two or more parts including a core text region and supportive regions. Identifying a core text region may be particularly useful for scene text detection because the more central or core part of text regions may be more likely to be pure text, while the supportive part may be mixed with different background information. Some embodiments may utilize a fully convolutional neural network (FCN) framework for unconstrained scene text detection. For example, a suitably trained FCN may create a scene text detection network. An input image may be provided to the scene text detection network to separate core text regions and supportive text regions from the background region of the image in a semantic segmentation formulation.
Some embodiments may provide an end-to-end learning framework which may separate core text regions and supportive text regions from background image information based on semantic segmentation. Advantageously, providing an additional segmentation class for the supportive text region may improve the scene text detection results. As compared to some other scene text detection techniques, some embodiments may involve much simpler post-processing which may be readily implemented in an end-to-end framework (e.g., on a mobile or edge device). For example, some embodiments may provide better performance with near real-time processing speed. Some embodiments may provide a software stack for text-based visual analytics which may be utilized for a variety of applications including autonomous driving cars, IoT (e.g., a self-service store), player identification in sports, exercise searching for online children's education, text translation for travelers, image retrieval, daily assistant for blind or visual impaired person, and so on.
Turning now to
Turning now to
Turning now to
Turning now to
Turning now to
Turning now to
Post-Processing Examples
Turning now to
Turning now to
For example, some embodiments may expand a word's borders (e.g., left, right, up, down) based on the following rules: a) the distance of border expanding may be proportional to the core-region size (e.g., the distance of left and right border expanding may be proportional to the width of the core text region, while the distance of up and down border expanding may be proportional to the height of the core text region); b) if the border meets with any border of other words after expanding, then the current expansion is canceled, and no further expansion is made in that direction; c) if the word rectangle does not introduce additional supportive text pixels after expanding, then the current expansion is canceled, and no further expansion is made in that direction. The method 110 may determine if the bounding boxes stopped expanding for all words of the connected cluster at block 117 and, if not, may return to block 116 to continue the expansion process for every word until all borders stop expansion. When the expansion has completed at block 117, the method 110 may then rotate each word rectangle back to its original direction at block 118.
Some embodiments of a scene text detector (e.g., a scene text detection network together with a post-processor) may provide good performance on the to COCO-Text challenge dataset. For example, some embodiments of a scene text detector may achieve an F-score of about 47.15% on the validation set with near real-time processing speed (e.g., about 15 fps on VGA input).
Turning now to
The processor core 200 is shown including execution logic 250 having a set of execution units 255-1 through 255-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 250 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic 260 retires the instructions of the code 213. In one embodiment, the processor core 200 allows out of order execution but requires in order retirement of instructions. Retirement logic 265 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core 200 is transformed during execution of the code 213, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 225, and any registers (not shown) modified by the execution logic 250.
Although not illustrated in
Referring now to
The system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in
As shown in
Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b (e.g., static random access memory/SRAM). The shared cache 1896a, 1896b may store data (e.g., objects, instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively. For example, the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor. In one or more embodiments, the shared cache 1896a, 1896b may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof.
While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the processing elements 1070, 1080 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the various processing elements 1070, 1080 may reside in the same die package.
The first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088. As shown in
The first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively. As shown in
In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited.
As shown in
Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of
Example 1 may include an electronic processing system, comprising a processor, memory communicatively coupled to the processor, and logic communicatively coupled to the processor to apply a trained scene text detection network to an image to identify a core text region, a supportive text region, and a background region of the image, and detect text in the image based on the identified core text region and supportive text region.
Example 2 may include the system of Example 1, wherein the logic is further to split connected words into one or more word regions based on the identified core text region and supportive text region.
Example 3 may include the system of Example 1, wherein the logic is further to remove a word region in response to a lack of core text region pixels in the word region.
Example 4 may include the system of Example 1, wherein the logic is further to train the scene text detection network with a plurality of image training samples, the scene text detection network including a dense features portion, a reverse connections portion communicatively coupled to the dense features portion, and a stage losses portion communicatively coupled to the reverse connections portion.
Example 5 may include the system of Example 4, wherein the logic is further to support large receptive field features for the dense features portion.
Example 6 may include the system of any of Examples 4 to 5, wherein the logic is further to train the scene text detection network with a plurality of online hard examples mining training samples.
Example 7 may include a semiconductor package apparatus, comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is at least partly implemented in one or more of configurable logic and fixed-functionality hardware logic, the logic coupled to the one or more substrates to apply a trained scene text detection network to an image to identify a core text region, a supportive text region, and a background region of the image, and detect text in the image based on the identified core text region and supportive text region.
Example 8 may include the apparatus of Example 7, wherein the logic is further to split connected words into one or more word regions based on the identified core text region and supportive text region.
Example 9 may include the apparatus of Example 7, wherein the logic is further to remove a word region in response to a lack of core text region pixels in the word region.
Example 10 may include the apparatus of Example 7, wherein the logic is further to train the scene text detection network with a plurality of image training samples, the scene text detection network including a dense features portion, a reverse connections portion communicatively coupled to the dense features portion, and a stage losses portion communicatively coupled to the reverse connections portion.
Example 11 may include the apparatus of Example 10, wherein the logic is further to support large receptive field features for the dense features portion.
Example 12 may include the apparatus of any of Examples 10 to 11, wherein the logic is further to train the scene text detection network with a plurality of online hard examples mining training samples.
Example 13 may include a method of detecting text, comprising applying a trained scene text detection network to an image to identify a core text region, a supportive text region, and a background region of the image, and detecting text in the image based on the identified core text region and supportive text region.
Example 14 may include the method of Example 13, further comprising splitting connected words into one or more word regions based on the identified core text region and supportive text region.
Example 15 may include the method of Example 13, further comprising removing a word region in response to a lack of core text region pixels in the word region.
Example 16 may include the method of Example 13, further comprising training the scene text detection network with a plurality of image training samples, the scene text detection network including a dense features portion, a reverse connections portion communicatively coupled to the dense features portion, and a stage losses portion communicatively coupled to the reverse connections portion.
Example 17 may include the method of Example 16, further comprising supporting large receptive field features for the dense features portion.
Example 18 may include the method of any of Examples 16 to 17, further comprising training the scene text detection network with a plurality of online hard examples mining training samples.
Example 19 may include at least one computer readable medium, comprising a set of instructions, which when executed by a computing device, cause the computing device to apply a trained scene text detection network to an image to identify a core text region, a supportive text region, and a background region of the image, and detect text in the image based on the identified core text region and supportive text region.
Example 20 may include the at least one computer readable medium of Example 19, comprising a further set of instructions, which when executed by the computing device, cause the computing device to split connected words into one or more word regions based on the identified core text region and supportive text region.
Example 21 may include the at least one computer readable medium of Example 19, comprising a further set of instructions, which when executed by the computing device, cause the computing device to remove a word region in response to a lack of core text region pixels in the word region.
Example 22 may include the at least one computer readable medium of Example 19, comprising a further set of instructions, which when executed by the computing device, cause the computing device to train the scene text detection network with a plurality of image training samples, the scene text detection network including a dense features portion, a reverse connections portion communicatively coupled to the dense features portion, and a stage losses portion communicatively coupled to the reverse connections portion.
Example 23 may include the at least one computer readable medium of Example 22, comprising a further set of instructions, which when executed by the computing device, cause the computing device to supporting large receptive field features for the dense features portion.
Example 24 may include the at least one computer readable medium of any of Examples 22 to 23, comprising a further set of instructions, which when executed by the computing device, cause the computing device to train the scene text detection network with a plurality of online hard examples mining training samples.
Example 25 may include a text detector apparatus, comprising means for to applying a trained scene text detection network to an image to identify a core text region, a supportive text region, and a background region of the image, and means for detecting text in the image based on the identified core text region and supportive text region.
Example 26 may include the apparatus of Example 25, further comprising means for splitting connected words into one or more word regions based on the identified core text region and supportive text region.
Example 27 may include the apparatus of Example 25, further comprising means for removing a word region in response to a lack of core text region pixels in the word region.
Example 28 may include the apparatus of Example 25, further comprising means for training the scene text detection network with a plurality of image training samples, the scene text detection network including a dense features portion, a reverse connections portion communicatively coupled to the dense features portion, and a stage losses portion communicatively coupled to the reverse connections portion.
Example 29 may include the apparatus of Example 28, further comprising means for supporting large receptive field features for the dense features portion.
Example 30 may include the apparatus of any of Examples 28 to 29, further comprising means for training the scene text detection network with a plurality of online hard examples mining training samples.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrase “one or more of A, B, and C” and the phrase “one or more of A, B, or C” both may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2017/109885 | 11/8/2017 | WO | 00 |