a shows an example for a layout area;
b shows an example for a layout document;
The present invention may be implemented by a computer system as shown in
The present invention will hereinafter be described in connection with the extraction of a date of birth from a curriculum vitae as shown in
The curriculum vitae is stored in a computer or on a data carrier in electronic form, it may have been the result of an editing using a word processor, or the electronic document may be the result of a scanning process and a subsequent optical character recognition process. Instead of a curriculum vitae any document may be used from which an element having a specific meaning or falling into a certain category is to be extracted.
At first the electronic document is analyzed to obtain the individual elements out of which it is composed. “Element” here means any sequence of characters which is separated from other elements by a delimiter, such as a blank, a tabulator, an underscore, or by any other data element which is to be interpreted as delimiting one element from another. The most simple way of splitting a text into individual elements is by identifying those textual parts as elements which are separated from each other by any empty space (a blank), however, depending on the purpose of the analysis also further criteria may be taken into account, such as the already mentioned underscore, a hyphen, a carriage return, or other elements of the electronic document which may be regarded as delimiting one element from another. Another criterion which could be taken into account when identifying individual elements could be the geometrical distance between individual characters. For example, there could be defined a threshold value beyond which a distance between two characters is to be interpreted such that the two characters are different elements. In the present example we assume that an element is any single character or sequence of characters separated from other “elements” by a blank.
In the present example of a text document as shown in
Apart from obtaining the elements themselves there is obtained their corresponding position in the document, e.g. by calculating the x- and y-coordinates where each element is located in the document. The position will later be used for generating the layout document.
After having identified the individual elements of the electronic text document, those elements are stored in a so-called “working document”. In the working document each element which has been identified is stored together with information about its position in the electronic document. For example, the element “curriculum” may be stored together with its x- and y-coordinates identifying its position in the electronic document. The working document is a convenient tool for storing all elements which have been identified together with their corresponding position so that for the generation of the layout document which is explained later in detail reference can be made to the working document. An example of a working document generated from any text document is shown in
The position of an element may represent for example the center of gravity of an element calculated based on its individual pixel values, or it may represent any other geometrical information representing the location of the element. For example, a box may be constructed surrounding the element, and the average between the maximum and minimum x-coordinates of the box may be taken as the x-coordinate for the position, and the average of the maximum and the minimum y-position of the box may be used as the y-coordinate of the element when representing its position in the text through a corresponding tag in the working document.
The working document contains a list of the identified elements together with tags indicating their respective position and possibly also further information as mentioned before, such as further information like about the fonts of the elements, their style, whether they are underlined or not, etcetera.
In this way the working document is created and as containing a list of the individual elements of the electronic text document together with their corresponding position an possibly also other information. Also non-textual elements may be incorporated into the working document, such as horizontal or vertical lines or grids contained in the electronic document, which then are also stored in the working document in a form representing their position and their shape (horizontal, vertical, line, grid, or the like) according to a coding scheme. E. g. a horizontal line may be represented in a working document by character sequence AAAA, a vertical line may be represented by character sequence BBBB, each then followed by a tag indicating the position of the line.
The so created working document may then be used for identifying candidate elements which could possibly be the element to be extracted. For that purpose the working document (or possibly also the “source document” based on which the working document is generated) is parsed to identify those elements which meet a certain search criterion, such as a format criterion. In this step of extracting a candidate all elements are analyzed to find possible candidates for the desired elements to be extracted.
Preferably not only individual elements are searched but also combinations of elements so that the method can cope with spaces between the individual elements. For example, when searching for a banking account number which is presumed to have eight digits, a search may be carried out for a number which has eight digits which may either be represented as “99999999” or as “999 999 99” or as “9 9 9 9 9 9 9 9”, or any other combination. Searching for such a banking account number may therefore for example be carried out by searching for a number having eight digits. Depending on the informational content which the element to be extracted should have, another format may be used as the search criterion. Possible search criteria are searching for regular expressions (such as a format search searching for a certain format, like a character string, a sequence of numbers, possibly also requiring a certain total number of digits), or the like. Another search criterion could be that a search is performed for a simple predefined element, by carrying out a string comparison. For example a search may be performed for the word “birth”, and each element meeting that search criterion would then show up as a candidate.
Another possible search criterion could be to use a so called designator search, which means that a element is searched which is at a certain position (left/right/above/below) with respect to a candidate found by another search criterion. For example, if a search criterion would be to search for the word “birth”, then a designator search could be performed for the element located right to the element “birth”, and in this case the resulting candidate would be the element located right to the element “birth”. In the example of
Another search criterion could be to carry out a search for all elements which are also present in a database.
The search for candidates preferably is fault tolerant in the way that prefixes/suffixes can be ignored, in order to ignore typical errors from optical character recognition, or to be able to ignore such elements like “,” and “:”. For example, in the case of
Depending on the manner the candidate search is performed, more or less candidates for the elements to be extracted are identified.
Other search methods could for example include a trigram search, which means that combinations of three characters are searched for. This is also a method of carrying out a fault tolerant search, if for example a misspelling occurs in a candidate, then a trigram search could nevertheless obtain such a candidate since several character sequences contained in the candidate would be recognized as correct trigrams. Another fault tolerant search method would be to use the Levenshtein distance, which is a representation of the number of key strokes necessary on a keyboard to change one character sequence into another one. Based on the Levenshtein distance also a fault tolerant search could be performed.
Preferably the candidate search is performed by searching the workin document for elements which match the used search criterion. Thereby the analysis of the document into elements which has already been carried out can be used. In principle, however, a search for candidates can also be carried out directly on the text document.
The search is directed to obtain candidate elements which could possibly contained the information which is searched for. It is readily apparent that depending on the information which is searched for the search criteria have to be adapted accordingly. If an account number is searched, then preferably a format criterion is used which makes use of the possibly known number format of the account number, to the contrary, if a place of birth is searched for, then searching for character strings is more promising then searching for numbers. The adaption of the search criteria (format search, word search, database search, designator search, etc. or a combination of them) to the particular piece of information which is searched can be chosen by the skilled person depending on the particular circumstances.
If the found candidates are to be used in a training procedure for a classifying apparatus as will be described later in more detail, then it is preferable if they are somehow indicated or displayed to the user an if the user is then able to confirm whether the found candidates match with the searched information or not. Thereby the classifying apparatus then can be trained as will be explained later. Displaying the candidates can be e.g. done by highlighting them in the searched text document, and to then enable the user to confirm or to discard them e.g. by a mouse click.
The format search or fault tolerant element search provides candidates for elements to be extracted. The result of the candidate search is already quite good in terms of correctness since it is based on inherent properties of the elements which are searched, such as their format or their actual informational content. The candidates then can however be further evaluated with respect to whether they belong to a certain category or not by taking into account elements other than the candidates as well, as will be explained in the following.
For each of the candidates then there is created a so-called layout document containing not only a representation of the candidate and its position in the electronic document, but also of other elements surrounding said candidate element and their corresponding position. Therefore the layout document is an electronic representation of the candidate and its position in the electronic document itself, as well as of other elements in the electronic document and their corresponding position. Preferably a layout document generated for a certain candidate is generated for a certain area surrounding said candidate. This area (or a corresponding plurality of areas) can either be predefined or they may be defined by a user.
An example for the definition of such a surrounding area through a user interface is shown in
For generating the layout document all the elements which with respect to their position in the electronic document fall into the boxes defining the area of the layout document are taken into account for generating the layout document. For that purpose reference can be made to the working document in which all elements are stored together with their corresponding positions.
In the following it is assumed that the process of obtaining candidate elements has returned the element May 5, 1960 of the document of
After one or more candidates have been obtained through a search procedure as explained above, for each of the candidates there is created a layout document which is a representation of the candidate as well of ist surrounding area. To create the layout document the elements which lie within an area to be used for the generation of the layout document are at first identified and then based on these elements the layout document is created. It contains a representation of the candidate as well as of the elements lying within this area together with the corresponding position of those elements.
a shows an example for a layout area in case of the text document of
An example for the layout document generated for the candidate “May 5, 1960” and the corresponding layout area as show in the example of
The first line of the layout document shown in
To further explain how the position of the candidate element in the electronic document is represented in the layout document through the character sequence “MXMYWLHM”, reference is made to
Therefore, as can be seen from the first line of the layout document shown in
Similarly to the width also the height of the candidate box is coded by one of the sequences “HS”, “HN”, “HL”, or by “HX”. For the present case of
The position of the candidate box in X- and Y-direction is coded as schematically illustrated in the lefthand part of
In the present case of
It is to be understood that the coding shown in
Similarly, the coding sequences used herein are just arbitrary, here “LL” means just “to the very left”, “MX” means “rather in the middle in X-direction”, and “RR” means “at the very right of the document (in X-position)”. Similarly, “TT” means “at the very top”, “MY” means “rather in the middle”, and “BB” means “at the very bottom of the document with respect to Y-direction”. However, other coding sequences could be used as well as will be recognized by the skilled person. Also, instead of DDMMYY other character sequences could be used to represent the recognized format of a “date”.
After having coded the candidate box as explained above the other elements which fall into the area of the layout document as explained with respect to
The layout document as shown in
The second line of the layout document of
The second and the third line of the layout document shown in
For coding the relative position any coding scheme can be used, the particular one used herein is schematically illustrated in
From
Since the number 8125 is horizontally equal but near above the candidate this leads to the third line in
The remaining three elements “date”, “of”, and “birth:” falling into the layout area are represented in the last three lines of the layout document of
It will be readily apparent that instead of the relative position coding also absolute positions of the elements within the layout area could be used for generating the layout document.
Furthermore it is also possible that when generating the layout document other such elements for which the format is recognizable, not only for example such as if the element has the format of a “date”, are coded in the layout document by a corresponding coding sequence. While this is here only shown for the date in the first line and the integers in the second and third line of the layout document, such a replacement can also be made for other recognizable elements such as postal ZIP codes (which could be recognized from a database query) and which could be represented by a certain character sequence such as ZZZ, or the like. The corresponding recognition can either be based on a format recognition, on a data base query (where e.g. all postal ZIP codes are stored).
As explained above, a layout document is generated which contains information about the candidate itself, its position in the document, and furthermore information about other elements of the document and their position in the document. The position information is in the present example represented by replacing coordinate values by character sequences representing a position according to a certain coding scheme which is used to define locations or areas into which the electronic document is partitioned for coding purposes and which have assigned corresponding character codes. Similarly, number codes can be used as well for coding the positions of the elements of said electronic document. Any coding scheme which represents the position and/or the format of the elements can is be used for the generation of the layout document.
The layout document may also contain additional information about non-textual elements of the document to be analyzed, such as lines, or grids in the document. This information also can be easily obtained through a geometrical analysis of the document, and then lines or grids present in a document can be coded in the layout document through corresponding coding sequences, preferably also by representing their corresponding position, possibly also their style and further information.
Preferably the coding scheme used for the generation of the layout document contains a position coding based on having assigned discrete areas of location corresponding position codes as explained before. Further preferably style or format information which can be recognized such as the format or style of elements also is represented in the layout document through corresponding coding sequences. It is, however, possible to use only some of those elements of a coding scheme to generate a layout document.
The position indicated in the layout document may be a representation of the geometrical position based on coordinate values, such as the x and y coordinate values explained before. It is, however, also possible that the position information for an element in the layout document represents the relative position between the candidate and this element, such as the number of elements occurring between this element and the candidate. Thereby it also becomes possible to code the relative position between the candidate and other elements in the layout area through the distance between them through the number of words occurring between them. Such a coding scheme could e.g. be useful if the text document to be processed actually has not much of an own layout, such as an e mail message. Alternatively, however, for an e-mail a virtual layout could be calculated and used for the further processing instead of the relative position of the elements as explained before.
The more information is present in the layout document about the candidate and its surrounding elements, the more accurate the layout document and the subsequent processing result can be. However, the more sophisticated the layout document is, the more processing power is needed to create the layout document and to further process it to make the decision, therefore, depending on the desired accuracy of the decision procedure the user or a programmer may choose the area for generating the layout document as well as the information to be used when generating the layout document.
Hereinbefore the obtaining of candidates and the subsequent generation of a layout document for a candidate have been explained. If it is now desired for example that a certain piece of information, namely the date of birth is to be extracted from the document of
Such a recognition becomes possible since the layout document generated from a date of birth contains further hints which make it possible to recognize them as the layout documents from dates of birth rather than any other dates. It is e.g. often the case that the word “birth” occurs in the neighbourhood of the date of birth, and by having a layout document where this term is included there is a further hint that this is the layout document generated from a date of birth. Similarly, other elements occurring in the neighbourhood of the date of birth may also be interpreted as a hint, like the term “place” or the term “of” as is the case in the example of
Of course, the layout document can also directly be generated for all elements of a text document and then each element can be evaluated based on the so generated layout document as to whether it belongs to a certain desired category or not. However using a candidate search first reduces the computational costs which would arise if a layout document would have to be generated for each element of the text document.
In the following the extraction process and the training process using a classifying apparatus will be explained in more detail.
After the layout document has been generated, it may be used for training a neural network or any other computerized system which can decide whether a certain document belongs to a certain category or class or not. For that purpose, the layout documents of candidates are input to the neural network or any other decision apparatus (classifying apparatus) together with the information whether the layout document corresponds to a correct candidate or not, which means whether the candidate has the desired informational content or not.
A training of such a neural network is schematically illustrated in
An electronic document is analyzed as explained above to obtain elements and of a text document and their corresponding positions. Preferably then therefrom a text-based document, a working document is created. Then a filtering is performed to obtain therefrom a set of candidates which could possibly match with a desired category. Preferably the obtained set is corrected, either based on a manual input by the user or automatically, e.g. by checking whether an obtained candidate has a probability of correctness beyond a certain threshold. For a manual correction in the training phase the candidates can be highlighted in the document and the user can then for some or all of them confirm whether they are correct ones or not. The aforementioned manual or automatic selection of correct results then leads to a set of correct results and to a set of wrong results. For each of the elements of the set of the correct results and for each of the elements of the set of the wrong results then layout documents are generated. Thereafter the layout documents generated for the set of wrong results and the ones generated for the set of correct results are used to train the neural network. If no candidate is recognized at all, then the user may also choose himself a candidate, highlight it (e.g. by the mouse) and use it as a training input.
An extraction process using a network which has been trained as shown in
An output of the network may consist in the correctly extracted candidates, or e.g. also in a weight weighing the probability of correctness for each candidate. The extracted candidates may also directly be imported or exported into another electronic document, such as a database, an MS-Excel file, a table, a Word document or any other document suitable for further electronic processing, or the like.
The extraction process including the identification of the candidates and the generation of the layout document can be carried out as explained in detail above. For all found candidates then the corresponding generated layout document is input to a classifying or decision apparatus not necessarily though preferably being a neural network, and then for each candidate a decision is made whether it belongs to the correct category or not.
A particularly suitable apparatus for classifying the generated layout documents as to whether they belong to the desired category or not is disclosed in European patent application 99108354.4, the whole content of which is incorporated hereinto by reference. The apparatus disclosed therein is able to classify text documents by representing them as vectors, where the values of the vector components corresponds to the frequency with which a certain word or term occurs in the document. Such a vector representing a document spans up a n-dimensional vector space, and several documents together also span up a certain vector space. The classification is performed by calculating a hyperplane which separates the vector space into at least two sub spaces, thereby a classification into as many classes as sub spaces are present can be performed. A learning or training process consists in building up the vector space and the corresponding separating hyperplane for a set of training documents, and an unknown document then can be classified by calculating whether the corresponding vector falls into one or the other sub space. Since with the method described hereinbefore in detail it is possible to represent elements of a text document through a layout document which gives hints about their surrounding areas, and since the layout document itself again is a text document, the classifying apparatus disclosed in the aforementioned European patent application can be used for classifying purposes. A preferable implementation of the apparatus for classification disclosed in the patent application consists in a neural network, such as in a Perceptron. Further details as to how the decision apparatus may be implemented can be taken from this application and will therefore not be outlined in further detail herein.
However, it is to be understood that any other neural network or any computer method or apparatus which is capable of evaluating (classifying) documents with respect to whether they belong to a certain category or not can be used for training layout documents and then for making the decision whether a candidate (or ist corresponding layout document) has to be regarded as correctly extracted or not. It should further be understood that also any other layout document presentations can be used in connection with the present invention, not only those layout documents where the positions are represented by character sequences. It is for example also very well possible that the positions are coded by absolute numbers representing the positions (coordinates) or by angles and distances (polar coordinates).
It will be understood by the skilled person that the aforementioned detailed description only illustrates an exemplary embodiment of the present invention, other embodiments being well within the reach of the general knowledge of the skilled person. It is further readily apparent to the skilled person that the method of the present invention may be implemented by any computer system, by any general purpose computer, or by any other dedicated hardware carrying out a method as explained before. An apparatus according to the present invention therefore may consist in any computer system carrying out the method of the present invention, whereas the apparatus may for example consist in a computer system as shown in
Moreover, a data structure representing the structure of a layout document as described can also form an embodiment of the present invention, independent whether it is incorporated or embodied on a storage medium, a data carrier, a transmission line, a memory such as a ROM, a RAM, or the like.
Furthermore, the present invention may be used in a client server architecture, which means that parts of a computer program implementing the present invention may be executed at a server and other parts may be executed at a client.
As far as apparatus components are mentioned in the description before or in the appended claims, they may be realized either by a computer carrying out a computer program or certain program instructions, or they may be implemented by any dedicated hardware performing the function of that component, such as an electronic circuit, a special purpose computer, or the like.
Further modifications and applications of the present invention will be apparent to the skilled reader, and it will be understood that the present application has been explained by means of exemplary embodiments which are not to be understood as limiting the scope of the present invention.
In particular, it is to be understood that the example of extracting a date of birth is just an exemplary example, and the method explained hereinbefore can be used for extracting any information element which belongs to a certain category from a text document, as will be readily apparent to the skilled reader.
| Number | Date | Country | Kind |
|---|---|---|---|
| 00103810.8 | Feb 2000 | EP | regional |
| Filing Document | Filing Date | Country | Kind | 371c Date |
|---|---|---|---|---|
| PCT/EP01/01132 | 2/2/2001 | WO | 00 | 8/27/2007 |