Text is frequently electronically received in a non-textually editable form. For instance, data representing an image of text may be received. The data may have been generated by scanning a hardcopy of the image using a scanning device. The text is not textually editable, because the data represents an image of the text as opposed to representing the text itself in a textually editable and non-image form, and thus cannot be edited using a word processing computer program, a text editing computer program, and so on. To convert the data to a textually editable and non-image form, optical character recognition (OCR) may be performed on the image, which generates data representing the text in a textually editable and non-image form, so that the data can be edited using a word processing computer program, a texting editing computer program, and so on.
As noted in the background second, data can represent an image of text, as opposed to representing the text itself in a textually editable and non-image form that can be edited using a word processing computer program, a text editing computer program, and so on. To convert the data to a textually editable and non-image form, optical character recognition (OCR) may be performed on the image. Performing OCR on the image generates data representing the text in a textually editable and non-image form, so that the data can be edited using a computer program like a word process computer program or a text editing computer program.
However, OCR is not perfect. That is, even the best OCR techniques do not yield 100% accuracy in converting an image of text to a non-image form of the text. It is said that an OCR technique is performed by an OCR engine, such as an OCR computer program. Different OCR engines may perform different types of OCR techniques. Furthermore, different OCR engines may be able to more accurately perform OCR on images of text of different text types.
For example, one OCR engine may be able to more accurately perform OCR on images of text in one font, whereas another OCR engine may be able to more accurately perform OCR for images of text in another font. As another example, one OCR engine may be able to more accurately perform OCR on images of text that is underlined, whereas another OCR engine may be able to more accurately perform OCR for images of the same text but that is not underlined. A given image of text, however, may include text of different text types. As such, it is difficult to select one OCR engine that is most accurately able to perform OCR on each different text type.
Disclosed herein are approaches to compensate for these drawbacks of OCR techniques. Specifically, each of a number of different OCR engines has a confidence value for each of a number of different text types. When an image of unknown text having a given text type is received, the image is input into each OCR engine, and output text corresponding to this image is received from each OCR engine. If the output text received from each OCR engine is not identical, then the output text from one OCR engine is selected as at least provisionally correct for the unknown text, based on the confidence values of the OCR engines for the given text type of the unknown text.
The confidence values for an OCR engine can be determined for a particular text type by generating an image of a known text sample having this text type. The image is input into the OCR engine to receive output text corresponding to the known text sample from the OCR engine. The output text received from the OCR engine is compared to the known text sample to determine the confidence value of the OCR engine for this text type. This process is repeated for known text samples of different text types, and for different OCR engines, to determine the confidence values for each OCR engine for each different text type.
The image 102 is input into three OCR engines 104A, 104B, and 104C, which are collectively referred to as the OCR engines 104, in the example of
The OCR engines 104 convert the image 102 to data representing text 108A, 108B, and 108C, respectively, which are collectively referred to as the text 108. The data representing each text 108 may be formatted in accordance with the ASCII or Unicode standard, for instance, and may be stored in a TXT, DOC, or RTF file format, among other text-oriented file formats. The data representing each text 108 can include a byte, or more than one byte, for each character of the text, in accordance with a standard like the ASCII or Unicode standard, among other standards to commonly represent such characters.
For example, consider the letter “q” in the text. A collection of pixels corresponds to the location of this letter within the image 102. If the image is a black-and-white image, each pixel is on or off, such that the collection of on-and-off pixels forms an image of the letter “q.” Note that this collection of pixels may differ depending on how the image was generated. For instance, one scanning device may scan a hardcopy of the text such that there are little or no artifacts (i.e., extraneous pixels) within the part of the image corresponding to the letter “q.” By comparison, another scanning device may scan the hardcopy such that there are more artifacts within the part of the image corresponding to this letter.
From the perspective of a user, the user is able to easily distinguish the part of each image as corresponding to the letter “q.” However, the portions of the images corresponding to the letter “q” are not identical to one another, and are not in correspondence with any standard. As such, without performing a process like OCR, a computing device is unable to discern that the portion of each image corresponds to the letter “q.”
By comparison, consider the letter “q” within the text 108 representing the text in a non-image form that may be textually editable. The letter is in accordance with a standard, like the ASCII or Unicode standard, by which different computing devices know that this letter is in fact the letter “q.” From the perspective of a computing device, the computing device is able to discern that the portion of the data representing this letter within the text 108 indeed represents the letter “q.”
In the example of
Therefore, which of the text 108 at least provisionally correctly corresponds to the text represented within the image 102 is selected based on the confidence values 106 of the OCR engines 104 for the text type of this text. As depicted in
An image of unknown text having a text type is received (202). The text is unknown in that there is not a corresponding version of the text in non-image form. The text type of the text may be one or more of the following: a particular type of font; a particular font size; whether or not the text is italicized; whether or not the text is bolded; whether or not the text is underlined; and whether or not the text has been struck through. These text types can be respectively considered as font type; font size, presence of italics; presence of bold; presence of underlining; and presence of strikethrough.
The text type of the unknown text in the image is a priori known, or is otherwise determined. For example, existing OCR techniques can be employed to determine the text type of the unknown text. Furthermore, for the purposes of the method 200, it is assumed that the unknown text is a single word. However, more generally, the method 200 is applicable to each word of multiple words represented within the image.
The definition of a word herein can be one or more characters between a leading space, or a leading punctuation mark, and a lagging space, or a lagging punctuation mark. Examples of punctuation marks include periods, commas, semi-colons, colons, and so on. As such, a word can include non-letter characters, such as numbers, as well as other non-letter characters, such as various symbols. Furthermore, a hyphenated word (i.e., a word containing a hyphen) can be considered as one word, including both parts of the word, to either side of the hyphen, or each part of the word may be considered as a different word.
The image of the unknown text is input into each of a number of OCR engines that use different OCR techniques to convert the image into text in a non-image form (204). As such, output text corresponding to the image of the unknown text is received from each OCR engine (206), in non-image form. The output text received from each OCR engine may be identical, or may be different. For example, for an image of unknown text A, one OCR engine may output text A1 as corresponding to this image, whereas two other OCR engines may output text A2 as corresponding to this image, where A2 represents different text from A1, and it is not known a priori which of A1 and A2 correctly corresponds to A, since the text A within the image is unknown.
If the output text received from each OCR engine is identical, then the method 200 is finished (210). In this case, it is concluded that the output text received from each OCR engine at least provisionally corresponds correctly to the unknown text within the image. For example, for an image of unknown text A, if each OCR engine outputs text A1, then it is concluded that A1 at least provisionally corresponds correctly to A.
However, if the output text received from each OCR engine is not identical, then the output text from one of the OCR engines is selected as at least provisionally correct for the unknown text within the image, based on the confidence values of the OCR engines for the text type of the unknown text (212). How this output text is selected based on the confidence values of the OCR engines for the text type of the unknown text can vary. For example, the output text may be selected as the output text received from the OCR engine having the highest confidence value for the text type of the unknown text (214), as described above in relation to
As another example, for each OCR engine, a weight for the output text received from this OCR engine may be set equal to the confidence value of the OCR engine for the text type of the unknown text (216). Where the output text received from two or more OCR engines is identical, the sum of the weights for these OCR engines is set as the weight for the output text received from these OCR engines (218). The output text that is selected as at least provisionally correct for the unknown text within the image is the output text that has the highest weight (220).
For example, there may be four OCR engines that have confidence values of 0.9, 0.8, 0.8, and 0.7 for the text type of the unknown text of the image. The OCR engine having the confidence value of 0.9 may output text A1 as corresponding to the unknown text of the image. The other three OCR engines may, by comparison, output text A2 as corresponding to the unknown text of the image. In conjunction with parts 216, 218, and 220, then, the weight for the text A1 is 0.9, whereas the weight for the text A2 is 0.8+0.8+0.7=2.3. Therefore, the output text selected as at least provisionally correct for the unknown text within the image is the text A2, even though the OCR engine outputting the text A1 has the highest confidence value for the text type of the unknown text.
The method 200 can be performed on a word-by-word basis for multiple words within the unknown text of an image. The text type of each such word may further be different. For example, different words within the text may have different fonts, different font sizes, some words may be underlined whereas other words may not be underlined, some words may be italicized whereas other words may not be italicized, and so on. Each OCR engine has a confidence value for each different text type.
For example, one OCR engine may have a confidence value for a particular font regardless of size, another OCR engine may have a confidence value for different sizes of a particular font, and so on. As another example, one OCR engine may have a confidence value for underlined text regardless of font, whereas another OCR engine may have a confidence value that is the same for a particular font regardless of whether the text is underlined. As a third example, a given OCR engine may have one confidence value for underlined text that is not italicized, another confidence value for italicized text that is not underlined, and a third confidence value for text that is both underlined and italicized.
An image of the known text sample is generated (302). For example, the image may be generated by printing the known text sample using a printing device, and then by scanning the image using a scanning device to generate data representing the image. As another example, a type of printer driver that generates images from text can be used to generate the image of the known text sample.
As a third example, the image can be generated by obtaining an image of text, where the text is not known a priori. A user can then manually input the text within the image, resulting in the text being known, and that the text of the image becomes the known text sample. As a related example, OCR may be performed on such an image of text, and the results of the OCR manually verified and corrected if there are any errors, resulting in the text being known, such that the text of the image becomes the known text sample.
For each OCR engine, the following is then performed (304). The image of the known text sample is input into the OCR engine (306). Output text corresponding to the image of the known text sample is subsequently received from the OCR engine (308). The output text received from the OCR engine is compared with the known text sample, to determine the confidence value of the OCR engine for the text type of the known text sample (310).
The confidence values for each OCR engine of one or more of the OCR engines can be periodically redetermined (312). For instance, when an OCR engine is upgraded to improve its OCR technique, the confidence values for the OCR engine can be responsively updated. As another example, new known text samples may be added, such that the confidence values for the OCR engines are responsively redetermined. As such, the process of the method 400 can be dynamic, where confidence values for the OCR engine may be periodically redetermined as desired.
The number of characters of the output text that are identical to corresponding characters of the known text sample is then divided by the total number of characters of the output text to yield the confidence value of the OCR engine for the text type of the known text sample (404). For instance, the number of characters of the words of the output text that are identical to corresponding characters of the known text sample may be divided by the total number of characters of the words of the output text to yield this confidence value. The method 400 is repeated for each OCR engine, to determine the confidence value of each OCR engine for the text type of the known text sample.
In the method 410 of
In the method 420 of
The methods 400, 410, and 420 thus vary in how the confidence value of an OCR engine for the text type of the known text sample is determined. The method 400 determines the confidence value by inspecting individual characters for accuracy, whereas the method 410 determines the confidence value by inspecting each word as a whole for accuracy. The method 420 is similar to the method 410, but effectively weights words correctly recognized by the OCR engine by the number of characters within the words. For instance, a correctly recognized word that has twelve characters affects the confidence value of the OCR engine for the text type of the known text sample more than a correctly recognized word that has four characters does.
Variations and extension can be made to the methods that have been described above. For instance, in the methods 410 and 420 of
In conclusion,
The computer-readable medium 508 stores data representing an image 102 of unknown text, as well as output text 108 corresponding to this image 102. The computer-readable medium 508 also stores known text samples 510, images 512 of the known text samples 510, and output text 514 corresponding to the images 512. The OCR engines 104 generate the output text 108 after being input the image 102, and also generate the output text 514 after being input the images 512.
The confidence value determination logic 504 is executed by the processor 502, and thus may be implemented as one or more computer programs stored on the computer-readable medium 508, or another computer-readable medium. The logic 504 determines the confidence level of each OCR engine 104 by performing the method 300 of
The word selection logic 506 is also executed by the processor 502, and thus may also be implemented as one or more computer programs stored on the computer-readable medium 508, or another computer-readable medium. The logic 506 selects which of the output text 108 to use as provisionally correct for the unknown text of the image 102 by performing the method 200 of