1. Field of the Invention
The present invention relates to a character recognition apparatus, a character recognition method, and a recording medium in which a character recognition program is stored. In particular, the present invention relates to a character recognition apparatus, a character recognition method, and a recording medium in which a character recognition program is stored, which enable the digitalization of documents in which printed characters and handwritten characters are mixed.
2. Description of the Related Art
In recent years, documents are increasingly being circulated using electronic means such as e-mail, but there are also many instances where documents are outputted on paper. One reason for this is because it is easy to add subjoinders by hand to paper documents.
Printed characters, in which electronic information such as character codes has been outputted on paper, can be returned with high probability to digitalized electronic information by using optical character reader (OCR) software. However, conventionally a practical recognition rate cannot be obtained for character information written by hand unless strict conditions are imposed, such as grid-designation and numbers-only, which becomes a hindrance to online/offline information exchange.
The present invention has been made in view of the above circumstances and provides a character recognition apparatus, a character recognition method, and a recording medium in which a character recognition program is stored, which enable the digitalization of documents in which printed and handwritten characters are mixed.
The character recognition apparatus of an aspect of the invention includes: a separation processing unit that separates, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; a printed character portion recognition processing unit that character-recognizes the printed character portions; and a handwritten character portion recognition processing unit that utilizes the character recognition result of the printed character portions to character-recognize the handwritten character portions.
Embodiments of the invention will be described in detail on the basis of the following drawings, wherein:
The printed character portion/handwritten character portion separation processing unit 12 generates a histogram on the basis of the contrast of pixels in the image data and the character colors, and on the basis of this separates the image data into image data comprising a printed character portion and image data comprising a handwritten character portion. If the image data comprising the printed character portion can be identified, then image portions present at other places may be regarded as the handwritten character portion.
The printed character portion OCR processing unit 13 uses pattern matching to compare the character patterns of the cut-out printed characters with printed character patterns registered in the printed character OCR dictionary 14, and outputs the portions with the highest similarity as the recognition result of the printed character portion.
The printed character OCR dictionary 14, the related word/synonym/antonym dictionary 16, the registration dictionary 17, the handwritten character OCR dictionary 19, the OCR result storage unit 20 and the final OCR result storage unit 23 may be configured by securing regions in one or plural hard disks.
Individual characters/words (nouns/proper nouns) in the printed character portion, and synonyms (words that are similar in meaning), related words, and terms corresponding to fields of the words in the printed character portion, are registered in the registration dictionary 17 as registration dictionary information. Examples of dictionaries of terms corresponding to fields include a business terminology dictionary with respect to phrases such as “your company” and “our company”, a name dictionary with respect to words such as names, and a computer terminology dictionary with respect to “memory” and “CPU”.
The handwritten character portion OCR processing unit 18 includes: a pre-processing unit 180 that conducts pre-processing such as orientation correction and cutting out rectangular regions including characters from the image data one character at a time; an individual character recognition unit 181 that uses the handwritten character OCR dictionary 19 to conduct character recognition processing one character at a time in regard to the rectangular regions cut out by the pre-processing unit 180; and a post-processing unit 182 that uses the registration dictionary 17 to conduct language processing with strings such as word units.
The individual character recognition unit 181 compares the feature data extracted from the cut-out handwritten characters with the feature data of the characters registered in the handwritten character OCR dictionary 19, and outputs the data with the highest similarity as the recognition result of the handwritten characters.
The handwritten character portion OCR processing unit 18 uses the result of the recognition of the printed character portion by the printed character portion OCR processing unit 13 to conduct character recognition of the handwritten character portion. The following are conceivable for the processing and range of the printed characters used.
Next, the operation of the first embodiment will be described with reference to FIGS. 2 to 5.
The scan document 25 shown in
When the scan document 25 is read by the image input unit 11, the scan document 25 is converted to digital signals and outputted to the printed character portion/handwritten character portion separation processing unit 12.
The printed character portion/handwritten character portion separation processing unit 12 separates the image data of the inputted scan document 25 into printed character image data 26 including the printed character portion 250, as shown in
Next, the printed character OCR processing unit 13 references the printed character OCR dictionary 14, conducts character recognition processing with respect to the printed character portion 250 of
Next, as shown in
Next, the handwritten portion OCR processing unit 18 conducts OCR processing with respect to the handwritten character portion 251 shown in
Table 1 shows a case where plural recognition candidates are indicated with respect to the content of the handwritten character portion 251. Here, “AUTOMATICALLY”, “AVTOMATICALLY”, “AUTOMATICALY” and “AUTONATICALLY” are indicated as candidate words with respect to the characters of the handwritten character portion 251. In this case, the reliability of OCR processing with respect to “AUTOMATICALLY” is calculated in regard to each word. Here, three words have the same reliability of 30%.
The post-processing unit 182 references the registration dictionary 17 to determine which of “AUTOMATICALLY”, “AVTOMATICALLY”, “AUTOMATICALY” and “AUTONATICALLY” should be selected. The post-processing unit 182 uses the occurrence frequencies of the printed characters and the closeness of the positions with respect to “AUTOMATICALLY” on the scan document 25 to calculate the reliability of each of the plural words. As shown in
Next, when the processing of the handwritten character portion OCR processing unit 18 ends, the OCR result synthesis processing unit 21 reads the OCR processing result with respect to the printed character portion 250 and the OCR processing result with respect to the handwritten character portion 251 from the OCR result storage unit 20, and synthesizes the printed character portion 250 with a printed character portion 252 as shown in
The attribute definition unit 31 registers, as attribute definitions in the printed character OCR dictionary 14, item names corresponding to attributes such as the destination, sender and number of pages that one wants to get out of a document serving as a reading target by an input operation of the user such as a fax cover sheet, and heading word groups such as synonyms with respect to the item names.
In the present embodiment, the printed character portion OCR processing unit 13 is configured to also output heading word groups as a word recognition result.
The matching processing unit 32 conducts matching processing of the OCR results resulting from the printed character portion OCR processing unit 13 and the handwritten character portion OCR processing unit 18.
Operation of the Second Embodiment
Next, the operation of the second embodiment will be described with reference to
The user registers, as attribute definitions in the printed character OCR dictionary 14, the attributes the user wants to get out of the fax cover sheet 33 shown in
Next, the fax cover sheet 33 is scanned with a scanner and inputted by the image input unit 11. The printed character portion/handwritten character portion separation processing unit 12 separates the inputted image data of the fax cover sheet 33 into the printed character portions 330 and the handwritten character portions 331 as described in the first embodiment. The printed character portion OCR processing unit 13 references the printed character OCR dictionary 14 and conducts OCR processing of the printed character portions 330, and the handwritten character portion OCR processing unit 18 references the handwritten character OCR dictionary 19 and conducts OCR processing of the handwritten character portions 331.
The matching processing unit 32 conducts matching processing of the OCR results resulting from the printed character portion OCR processing unit 13 and the handwritten character portion OCR processing unit 18. In this processing, the OCR result resulting from the handwritten character portion OCR processing unit 18 is matched with the registered heading word group, and the attribute closest to the entry position is allocated to the OCR result resulting from the handwritten character portion OCR processing unit 18. The position information of the handwritten character portions 331 on the fax cover sheet 33 is also saved. Next, the positions of the printed character portions 330 and the handwritten character portions 331 are matched from the positional relations between the printed character portions 330 and the handwritten character portions 331. In the fax cover sheet 33 of
Finally, the OCR result output unit 22 saves, in the final OCR result storage unit 23, the attributes that have become a group (TO, FROM, etc.), the attribute values (OVERSEAS DIVISION CHIEF, YAMADA, CENTRAL BRANCH OFFICE, COMPANY A, etc.), and the electronic information in which the attributes and attribute values have been printed as the printed character portions 330 and 331.
In the present embodiment, the printed character portion OCR processing unit 13 counts the extracted words, and registers the words with the highest frequency as attributes in the attribute/attribute value extraction result storage unit 41.
Operation of the Third Embodiment
Next, the operation of the third embodiment will be described with reference to FIGS. 9 to 11.
In the membership application 42, a specific printing form is formed by ruled lines with printed character portions 420 resulting from printed characters, and a name and address are entered by hand as handwritten character portions 421 in the printing form. A plural number of sheets in which the names are different are prepared as the membership applications 42.
First, the plural membership applications 42 are inputted to the image input unit 11 by being successively scanned with a scanner. Next, the printed character portion/handwritten character portion separation processing unit 12 separates the image data into the printed character portions 420 and the handwritten character portions 421 as described in the first embodiment. The printed character portion OCR processing unit 13 references the printed character OCR dictionary 14 and conducts OCR processing of the printed character portions 420, and the handwritten character portion OCR processing unit 18 references the handwritten character OCR dictionary 19 and conducts OCR processing of the handwritten character portions 421.
In the processing of the printed character portion OCR processing unit 13, the extracted words are counted, and registration content 43 in which the words whose ratio with respect to the total number of membership applications 42 is large, i.e., the words whose frequency is high, is used as the attributes registered in the attribute/attribute value extraction result storage unit 41 as shown in
Next, the printed character portions 420 and the handwritten character portions 421 are matched by the matching processing unit 32 from the distance between the printed character portions 420 and the handwritten character portions 421 and the positional relations between the printed character portions 420 above, below, right and left of the handwritten character portions 421. Here, the matching follows a rule in which the printed character portions 420 and the handwritten character portions 421 in the same ruled lines, frames and base colors are matched. In order to avoid double association, the printed character portions 420 that have been associated once are excluded from the list. Finally, the attributes and attribute values that have become a group are saved as registration content 44 in the form shown in
In the third embodiment, the membership applications 42 were described as examples of documents, but the present invention is not limited to the membership applications 42 and can also be applied to all documents having the same form and having printed character portions and handwritten character portions.
The present invention is not limited to the preceding embodiments, and may be altered within a range that does not change the gist of the invention. The constituent elements of the various embodiments may also be optionally combined.
As described above, some embodiments of the invention are outlined below.
In one embodiment of the invention, the character recognition apparatus comprises: a separation processing unit that separates, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; a printed character portion recognition processing unit that character-recognizes the printed character portions; and a handwritten character portion recognition processing unit that utilizes the character recognition result of the printed character portions to character-recognize the handwritten character portions.
In another embodiment of the invention, the character recognition apparatus comprises: a separation processing unit that separates, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; a printed character portion recognition processing unit that character-recognizes the printed character portions; a handwritten character portion recognition processing unit that utilizes the character recognition result of the printed character portions to character-recognize the handwritten character portions; and a synthesis processing unit that synthesizes the character recognition result of the printed character portions and the character recognition result of the handwritten character portions.
By synthesizing and outputting the character recognition result of the printed character portions and the character recognition result of the handwritten character portions, data of a document in which printed characters and handwritten characters are mixed can be converted to electronic data.
In another embodiment of the invention, the character recognition apparatus comprises: a separation processing unit that separates, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; a printed character portion recognition processing unit that references a dictionary relating to attributes to character-recognize the printed character portions; a handwritten character portion recognition processing unit that character-recognizes the handwritten character portions; and a matching processing unit that correlates strings in the handwritten character portions corresponding to the attributes of the character recognition result of the printed character portions.
By referencing the dictionary relating to attributes, attributes included in the printed character portions in the data of the document can be recognized, and the handwritten character portions corresponding to the attributes can be matched.
In still another embodiment of the invention, the character recognition apparatus comprises: a separation processing unit that separates, into printed character portions and handwritten character portions, data of plural documents in which printed characters and handwritten characters are mixed; a printed character portion recognition processing unit that character-recognizes the printed character portions of the data of the plural documents and stores, as attributes, strings whose frequency is high; a handwritten character portion recognition processing unit that character-recognizes the handwritten character portions; and a matching processing unit that correlates strings in the handwritten character portions corresponding to the attributes of the character recognition result of the printed character portions.
Even without using a dictionary relating to attributes, strings whose frequency is high in the data of the plural documents may be used as attributes, whereby the handwritten character portions corresponding to the attributes can be matched.
In still another embodiment of the invention, the character recognition method comprises: separating, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; character-recognizing the printed character portions; and utilizing the character recognition result of the printed character portions to character-recognize the handwritten character portions.
In still yet another embodiment of the invention, the character recognition method comprises: separating, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; referencing a dictionary relating to attributes to character-recognize the printed character portions; character-recognizing the handwritten character portions; and correlating strings in the handwritten character portions corresponding to the attributes of the character recognition result of the printed character portions.
In another embodiment of the invention, the character recognition method comprises: separating, into printed character portions and handwritten character portions, data of plural documents in which printed characters and handwritten characters are mixed; character-recognizing the printed character portions of the data of the plural documents and storing, as attributes, strings whose frequency is high; character-recognizing the handwritten character portions; and correlating strings in the handwritten character portions corresponding to the attributes of the character recognition result of the printed character portions.
In another embodiment of the invention, there is provided a recording medium readable by a computer, the recording medium storing a character recognition program executable by the computer to perform a function for recognizing characters, the function comprising: separating, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; character-recognizing the printed character portions; and utilizing the character recognition result of the printed character portions to character-recognize the handwritten character portions.
In yet another embodiment of the invention, there is provided a recording medium readable by a computer, the recording medium storing a character recognition program executable by the computer to perform a function for recognizing characters, the function comprising: separating, into printed character portions and handwritten character portions, data of a document in which printed characters and handwritten characters are mixed; referencing a dictionary relating to attributes to character-recognize the printed character portions; character-recognizing the handwritten character portions; and correlating strings in the handwritten character portions corresponding to the attributes of the character recognition result of the printed character portions.
In still another embodiment of the invention, there is provided a recording medium readable by a computer, the recording medium storing a character recognition program executable by the computer to perform a function for recognizing characters, the function comprising: separating, into printed character portions and handwritten character portions, data of plural documents in which printed characters and handwritten characters are mixed; character-recognizing the printed character portions of the data of the plural documents and storing, as attributes, strings whose frequency is high; character-recognizing the handwritten character portions; and correlating strings in the handwritten character portions corresponding to the attributes of the character recognition result of the printed character portions.
The foregoing description of the embodiments of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously, many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, thereby enabling others skilled in the art to understand the invention for various embodiments and with the various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents.
The entire disclosure of Japanese Patent Application No. 2004-273932 filed on Sep. 21, 2004 including specification, claims, drawings and abstract is incorporated herein by reference in its entirety.
INSTALLATION
MANUAL
PC
CD-ROM
INSERT
AUTOMATICALLY
SCREEN
INSTRUCTIONS
PERSONAL COMPUTER
LOAD
AUTO
MONITOR
UNINSTALL
REMOVE
JOHN DOE
ANY TOWN, ANY STATE
40
XXX-XXXX
JAN. 1, 1964
Number | Date | Country | Kind |
---|---|---|---|
2004-273932 | Sep 2004 | JP | national |