Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to being prior art by inclusion in this section.
The subject matter in general relates to generating text features. More particularly, but not exclusively, the subject matter relates to classifying text in a document by generating text features.
Millions of documents are produced every day that are reviewed, processed, stored, audited, and transformed into computer-readable data. Examples include educational forms, financial statements, government documents, human resource records, insurance claims, and legal paper, among many others. Documents typically comprise text segments, such as, headers, footers, heading, sub-headings and topics, among others. Such documents may be processed for identifying the text segments and classifying them.
Typically, each text segment may be encapsulated by a bounding block. Features may be generated, for use by classifiers, wherein features may be generated based on font, size, and context of tokens relative to other tokens within the segment.
Such conventional approach of feature generation has been observed to result in outcome, which may not be as desired in several scenarios.
In view of the forgoing discussion, there is a need for an improved technical solution for generating features from a document.
In an aspect, a method of generating text features from a document is provided. The method may be carried out by one or more processors. The method comprises grouping text in the document into multiple logical text blocks comprising one or more tokens. The processor may then select one of the logical text blocks for generating features and may further identify the logical text blocks neighbouring the selected logical block. The processor may qualify one or more of the neighbouring logical text blocks for generating features. Features are generated for the tokens in the selected logical block using the qualified logical text blocks.
This disclosure is illustrated by way of example and not limitation in the accompanying figures. Elements illustrated in the figures are not necessarily drawn to scale, in which like references indicate similar elements and in which:
The following detailed description includes references to the accompanying drawings, which form part of the detailed description. The drawings show illustrations in accordance with example embodiments. These example embodiments are described in enough detail to enable those skilled in the art to practice the present subject matter. However, it may be apparent to one with ordinary skill in the art that the present invention may be practised without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. The embodiments can be combined, other embodiments can be utilized, or structural and logical changes can be made without departing from the scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense.
In this document, the terms “a” or “an” are used, as is common in patent documents, to include one or more than one. In this document, the term “or” is used to refer to a non-exclusive “or”, such that “A or B” includes “A but not B”, “B but not A”, and “A and B”, unless otherwise indicated.
Referring to the figures, a system 100 for generating features from documents is provided. The steps of
At step 302, the system 100 may process the document 300 to group text into multiple logical text blocks 304a-304i, wherein one logical block may be separated from the other by whitespace. Each of the logical text blocks 304a-304i may encapsulate a text segment comprising one or more tokens. As an example, the logical text block 304a comprises the tokens “floating”, “amounts” and “:”. As an example, a logical text block, in other words, a text segment, may capture a concept, such as, a topic, paragraph, section, table cells or list.
Techniques of creating such logical text blocks are known. One such technique is taught by Cartic Ramakrishnan et al. in “Layout-aware text extraction from full-text PDF of scientific articles” Source Code Biol. Med., 2012; 7, 7. As an example, the system 100 may create logical text blocks by identifying neighbouring tokens. Referring to
In an embodiment, the threshold distance may be preset by the processor 102. The threshold distance may be different for different directions. As an example, the threshold distance for the tokens disposed in the upward direction may be different compared to the threshold distance for the tokens disposed in the leftward direction.
As a result of the process discussed above, the system 100 may generate multiple logical text blocks 304a-304i using the document 300. At step 204, the system 100 may select a logical text block for generating features, which may then be used for classification. In conventional methods, the text segments may be classified based on the contextual meaning of tokens relative to other tokens within a text segment. On the other hand, the system 100 may classify each of the logical text block 304a-304i by also considering contextual meaning of tokens in the selected logical text block relative to tokens in qualified neighbouring logical text blocks, which has been observed to lead to improved results.
At step 206, the system identifies logical text blocks neighbouring a logical text block, which has been selected for generating features. It may be noted that, the system 100 may carry out the discussed steps for all or at least some of the logical text block 304a-304i of the document 300. As an example, the system 100 may select the logical text block 304d comprising a single token “Period” and identify logical text blocks neighbouring the selected logical text block 304d. The system 100 may identify the neighbouring logical text blocks disposed along multiple directions from the selected logical text block 304d. As an example, the system 100 may identify the neighbouring logical text blocks disposed in any of upwards, downwards, leftwards, rightwards, and diagonal directions from the selected logical text block 304d.
At step 208, the system 100 may qualify one or more neighbouring blocks for generating the features for the tokens in the selected logical text block 304d. For greater certainty, neighbouring text blocks are not limited to a single closes block, and may include multiple neighbouring text blocks in each direction.
In an embodiment, the system 100 may qualify the neighbouring logical text blocks that may be disposed within a threshold distance from the selected logical text block 304d. The threshold distance for at least one direction may be different from the threshold distance for at least one of the remaining directions. Further, the threshold distance may be a function of the size of the selected logical text block 304d.
In another embodiment, the system 100 may qualify the neighbouring logical text blocks, depending on the size of each of the neighbouring logical text blocks. Further, the size may be a function of the size of the selected logical text block 304d.
In another embodiment, the system 100 may qualify the neighbouring logical text blocks, depending on the number of tokens within the neighbouring logical text blocks. Further, the number of tokens may be a function of the number of tokens of the selected logical text block 304d.
In yet another embodiment, one or more of the criteria discussed above may be applied to qualify the neighbouring logical text blocks.
At step 210, the system 100 may generate features for one or more of the tokens in the selected logical block 304d using one or more of the one or more qualified logical text blocks 204. The system 100 may generate features for tokens in the selected logical block 304d using the tokens in the qualified neighbouring text block, such as qualified logical text block 304h.
In an embodiment, the system 100 may include in the feature the direction in which the qualified logical text block is disposed relative to the selected logical text block. As a generalized example, if “T” is a token in the selected logical text block, “J” is a token in the qualified neighbouring logical text block, and “D” is the direction in which the qualified neighbouring logical text block is disposed relative to the selected logical text block, the feature for the token ‘T’ may be represented as:
Feature=“D|T|J”
The features may be generated by “n”-gram, wherein “n” is at least equal to 1.
As an example, consider the token “period” in the selected logical text block 304d and the qualified neighbouring logical text block 304h. The system may generate features “right|period|end”, “right|period|dates”, “right|period|:” and so on.
In an embodiment, in addition to the direction, the distance may also be included.
In an embodiment, a preconfigured number of tokens may be used in the qualified logical text block for generating the features. Further, some of the tokens in the qualified logical text block may be ignored for the purposes of generating the features.
In an embodiment, the number of tokens used in the qualified logical text block for generating the features may be a function of the number of tokens in the selected logical text block.
The system 100 may provide the features to a classifier for classification. In an embodiment, the text segments in each of the logical text blocks 304 may be classified using one the classifiers provided below.
a. Termination Date-Confirmations.
b. Fixed Rate Day Count Fraction
c. Floating Rate Day Count Fraction
d. Description of Premises:
e. Address of Premises
f. Square Footage of Premises
g. Guarantor
Table. 1 provided below illustrates the experimental results (average lifetime F1, Recall and precision) when the features generated, as discussed above are fed to the classifiers as compared to conventional feature generation. From the table, Table 1, it can be observed that, all the seven classifiers improve with the inclusion of the neighbouring logical blocks. Recall and F1 improve in all cases, though Precision suffered substantially for classifier (b). This is likely due to Fixed Rates being rarer in the training documents, only appearing in 47 of the 70 documents. Precision only improved by 0.02 on average, while Recall improved by 0.09 on average, indicating that inclusion of the neighbouring logical blocks may help the classifiers distinguish between true positives and false positives, likely due to the false text sequences being very similar to the true sequences, and only being distinguishable by their larger surrounding context. Overall, the F1 scores of the seven classifiers increases by 0.06 on average.
The processes described above is described as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, or some steps may be performed simultaneously.
Referring to
The memory module 104 may store additional data and program instructions that are loadable and executable on the processor 102, as well as data generated during the execution of these programs. Further, the memory module 104 may be volatile memory, such as random-access memory and/or a disk drive, or non-volatile memory. The memory module 104 may be removable memory such as a Compact Flash card, Memory Stick, Smart Media, Multimedia Card, Secure Digital memory, or any other memory storage that exists currently or will exist in the future.
The input/output module 106 may provide an interface for inputting devices such as keypad, touch screen, mouse, and stylus among other input devices, and output devices such as speakers, printer, and additional displays among other.
The display module 110 may be configured to display content. The display module 110 may also be used to receive an input from a user. The display module 110 may be of any display type known in the art, for example, Liquid Crystal Displays (LCD), Light emitting diode displays (LED), Orthogonal Liquid Crystal Displays (OLCD) or any other type of display currently existing or may exist in the future.
The communication interface 112 may provide an interface between the system 100 and external networks. The communication interface 112 may include a modem, a network interface card (such as Ethernet card), a communication port, or a Personal Computer Memory Card International Association (PCMCIA) slot, among others. The communication interface 112 may include devices supporting both wired and wireless protocols.
The example embodiments described herein may be implemented in an operating environment comprising software installed on a computer, in hardware, or in a combination of software and hardware.
Although embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the system and method described herein. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
Many alterations and modifications of the present invention will no doubt become apparent to a person of ordinary skill in the art after having read the foregoing description. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. It is to be understood that the description above contains many specifications, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the personally preferred embodiments of this invention.