The proposed technical solution relates to pattern recognition and particularly to preprocessing of a document in electronic form which is performed prior to operations of text recognition (or instead of recognition).
The proposed technical solution allows extracting information about the content and formatting from a vector/raster image of a document, for example, from a file in PDF format, which is sufficient to restore the document later in the original or close to original form in any known editable format.
A method of extracting information text information from an electronic image file in vector/raster format is known in the art. This method is used by the company-manufacturer of tools for obtaining documents in vector-raster format (PDF format). “Acrobat and PDF Library API Reference”, Jan. 7, 2005, Adobe Solutions Network, 3603p.
The disadvantage of this method is its ability to extract only text information, without retaining information about the formatting of the document.
The above method is taken as a prototype.
The technical result consists in broadening the capabilities of recognizing a document from an electronic image file in vector-raster format, increasing the reliability of obtaining text, raster, and vector objects, extracting the information about the formatting of the document, and accelerating the processing.
The known method does not allow achieving the described technical result.
The announced technical result is achieved by means of performing the following sequence of steps: fragmenting the image in order to obtain regions containing non-separable, logically connected fragments of text of the maximum possible size; processing text objects; processing vector objects; processing raster objects; discarding redundant and excessive information; processing objects other than text, raster, or vector objects using the methods of raster objects processing; analyzing each object with the help of all available information that has been obtained as a result of the processing of other objects.
Acceleration of the processing is achieved, among other things, by excluding or reducing some commonly performed operations.
For example, in many cases, the necessity to recognize a raster text is partially or completely discarded.
The essence of the method of preprocessing text information on the basis of the information about a vector-raster image in electronic form consists in the following.
During the preprocessing (prior to character recognition), the following operations are performed using the attributes of the file formatting which are available in the vector-raster image file.
The step of analyzing and uniting (assembling) character groups into lines includes at least the following steps:
a) determining the text orientation;
b) detecting text written as a superscript;
c) detecting text written as a subscript;
d) detecting text of dropped capitals.
After assembling, a row is divided into words on the basis of the location of blank spaces, if any, and the analysis of inter-character intervals where there are no blank spaces.
Vector objects are processed. Processing of vector objects includes at least the step of identifying separators, background, and substrates of blocks.
Raster objects are processed. Processing of raster objects includes at least the steps of: analyzing non-text objects in order to detect text images within them, detecting vector objects other than separators including those partially located outside the borders of the object.
Redundant and excessive information is discarded. Discarded redundant and excessive information includes at least the information about the shading of characters, about unnecessary attributes, and some other information depending on the peculiarities of the document.
The program processes objects other than text, raster, or vector objects using the methods of raster objects processing.
Each object is additionally analyzed with the help of all available information that has been obtained as a result of the processing of other objects. If, according to the results of the primary processing of an object, the program has obtained some information which can affect other objects, repeated analysis of these other objects is performed.
After dividing an object into rows and words, the program analyzes the correctness of the encoding of characters, and corrects it, if necessary. In order to determine the correctness of the encoding, the text is analyzed and the following are checked: the correspondence of the letters of the text to the alphabet of the given language, and the correspondence of the words of the text to the dictionary of the given language.
If the program has failed to extract the text with the help of other known methods, the text block is sent to recognition.
Number | Date | Country | Kind |
---|---|---|---|
2005138164 A1 | Dec 2005 | RU | national |