This application claims the benefit of Taiwan application Serial No. 109142019, filed Nov. 30, 2020, the disclosure of which is incorporated by reference herein in its entirety.
The disclosure relates in general to a document sentence concept labeling system, a training method and a labeling method thereof.
Document structure analysis is an important technology for deep document understanding and information extraction. The document structure analysis expands the scope of document analysis from the level of words and named entities to the level of larger contexts such as multiple sentences. This analysis includes dividing the full text of a document into smaller blocks, and giving different blocks corresponding category labels. For example, the sentences in the abstracts of biomedical scientific papers may be automatically divided and labeled with different sentence concepts, such as background, purpose, method, conclusion, and contribution.
After a document is labeled, the sentence sets corresponding to different sentence concepts can be obtained, and the result can be used as a feature information for higher-level applications. In practical use, the labeling accuracy is not easy to improve when the structures of the documents have great variation.
The disclosure is directed to a document sentence concept labeling system, a training method and a labeling method thereof.
According to one embodiment, a training method of a document sentence concept labeling system is provided. The training method of the document sentence concept labeling system includes the following steps. A plurality of labeled documents, each of which is labeled one or more sentence sets corresponding to one or more sentence concepts are received. A start position and an end position of each of the sentence sets are generated in each of the labeled documents. Orders of the sentence sets in each of the labeled documents are changed and the start positions and the end positions in each of the labeled documents are updated, to obtain a plurality of generated documents, each of which is labeled the sentence sets. Each of the generated documents is inputted into a pre-trained language model to obtain a set of word embeddings of each of the generated documents. The sets of word embeddings, the start positions and the end positions of the generated documents are inputted into a document analysis model for performing a training procedure of the document analysis model. The document analysis model is used to label the sentence concepts in an unlabeled document.
According to another embodiment, a labeling method of a document sentence concept labeling system is provided. The labeling method of the document sentence concept labeling system includes the following steps. An unlabeled document and one or more sentence concepts are inputted into a pre-trained language model to obtain a set of word embeddings of the unlabeled document. The set of word embeddings of the unlabeled document is inputted into a document analysis model to obtain a start position and an end position of a sentence set corresponding to each of the sentence concepts in the unlabeled document. Each of the sentence sets is obtained according to each of the start positions and each of the end positions.
According to an alternative embodiment, a document sentence concept labeling system is provided. The document sentence concept labeling system includes a position indexing unit, a data generation unit, a data generation unit, a pre-trained language model and a document analysis model. The position indexing unit is configured to receive a plurality of labeled documents, each of which is labeled one or more sentence sets corresponding to one or more sentence concepts. The position indexing unit generates a start position and an end position of each of the sentence sets in the labeled documents. The data generation unit is configured to change orders of the sentence sets in each of the labeled documents, and update the start positions and the end positions in each of the labeled documents to obtain a plurality of generated documents, each of which is labeled the sentence sets. The pre-trained language model is configured to obtain a set of word embeddings of each of the generated documents. The document analysis model is configured to receive the sets of word embeddings, the start positions and the end positions of the generated documents, for performing a training procedure. The document analysis model is used to label the sentence concepts in an unlabeled document.
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawing.
Please refer to
As shown in
Then, as shown in
Next, as shown in
Then, as shown in
Afterwards, as shown in
In this embodiment, any of the sentence sets SC12, SC13, SC14 analyzed by the document sentence concept labeling system 100 may contain more than one sentence. Moreover, the content inputted into the document sentence concept labeling system 100 is not a single sentence, but the full text of the unlabeled document DC1. In addition, the document sentence concept labeling system 100 does not classify the sentence concept for individual sentence, but identifies the start positions S12, S13, S14 and the end positions E12, E13, E14 of the sentence sets SC12, SC13, SC14 from the entire unlabeled document DC1.
Please refer to
The document sentence concept labeling system 100 can use the pre-trained language model 110, the document analysis model 120 and the sentence set selection unit 130 to label the sentence set SC12 corresponding to the sentence concept CL2 in the unlabeled document DC1 (The same is true for the sentence sets SC13 and SC14. During labeling, only one sentence concept is labeled at a time. For clearly illustrating this point, only the sentence set SC12 corresponding to the sentence concept CL2 is taken as an example for illustration in
Please refer to
Take the BERT model as an example. Generally, the input to the BERT model is a sentence with [CLS] and [SEP] as a special mark at the beginning and end, such as [CLS]sentence1[SEP] or [CLS]sentence1[SEP]sentence2[SEP] . . . [SEP]. Among them the sentence1 and the sentence2 are a sentence respectively.
In this embodiment, the input to the BERT model is the full text of the unlabeled document DC1 and the sentence concept CL2 (The same is true for sentence concepts CL3 and CL4. During labeling, only one sentence concept is labeled at a time. For clearly illustrating this point, only the sentence concept CL2 is taken as an example for illustration). The input for the BERT model is, for example, [CLS]label[SEP]text[SEP]. The label is one of sentence concepts CL1, CL2, . . . , CL5, and the text is the full text of the unlabeled document DC1.
Then, in step S420, the set of word embeddings V1 of the unlabeled document DC1 is inputted into the document analysis model 120 to obtain a start position (e.g. the start position S12) and an end position (e.g. the end position E12) of a sentence set (for example, the sentence set SC12) corresponding to each of the sentence concepts (for example, the sentence concept CL2) in the unlabeled document DC1. The document analysis model 120 includes a start token prediction unit 121 and an end token prediction unit 122. The start token prediction unit 121 is configured to predict the start position S12; the end token prediction unit 122 is configured to predict the end position E12. In this step, the document analysis model 120 contains a dense layer and a Softmax layer. The set of word embeddings V1 is inputted into the dense layer and the Softmax layer to generate the start position distribution probability and the end position distribution probability, and then the start position S12 and the end position E12 are obtained.
In this embodiment, the document analysis model 120 receives the set of word embeddings V1 generated from the full text of the unlabeled document DC1.
Next, in step S430, the sentence set selection unit 130 obtains each of the sentence sets (e.g. the sentence set SC12) corresponding to each of the sentence concepts (for example, the sentence concept CL2) according to each of the start positions (for example, the start position S12) and each of the end positions (for example, the end position E12).
In this embodiment, the labeling method of the document sentence concept labeling system 100 is performed based on analyzing the full text of the unlabeled document DC1. Subsequently, the range of the sentence set corresponding to the sentence concept is identified. It is not to perform the classification of a single sentence, so the sentence set SC12 that best fits the sentence concept CL2 can be found among all the sentences. Therefore, the labeling accuracy of the document sentence concept labeling system 100 can be greatly improved.
For example, please refer to Table 1 below. Compared with HSLN-CNN, HSLIN-RNN, AI2 and other labeling methods, the labeling method disclosed in this disclosure can achieve the highest accuracy (F1) in all three unlabeled documents.
The above embodiment is taken the sentence concept CL2 as an example for illustration. The rest of the sentence concepts CL1, CL3, CL4, CL5 can also be labeled according to the above steps.
The above is the labeling method of the document sentence concept labeling system 100. Before implementing the labeling method, the document sentence concept labeling system 100 must be properly trained. Please refer to
Please refer to
Next, in step S520, the position indexing unit 140 generates start positions S01, . . . , S05 and end positions E01, E05 of the sentence sets SC01, . . . , SC05 in the labeled documents DC0.
Next, in step S530, the data generation unit 150 changes the orders of the sentence sets SC01, . . . , SC05 in the labeled documents DC0, and updates the start positions S01, . . . , S05 and the end positions E01, . . . , E05 by the start positions S01′, . . . , S05′ and the end positions E01′, E05′ to obtain a plurality of generated documents DC0′. Each of the generated documents DC0′ is labeled with the sentence sets SC01, . . . , SC05. The generated documents DC0′ are no longer the original labeled documents DC0, but retain the sentence sets SC01, . . . , SC05.
Afterwards, in step S540, the generated documents DC0′ are inputted into the pre-trained language model 110 to obtain a plurality of sets of word embeddings V0′ of the generated documents DC0′.
Then, in step S550, the sets of word embeddings V0′ of the generated documents DC0′, the start positions S01′, . . . , S05′ and the end positions E01′, . . . , E05′ are inputted into the document analysis model 120 for performing a training procedure of the document analysis model 120. That is to say, when performing the training procedure, the labeled documents DC0 are not inputted into the document analysis model 120, but instead the generated documents DC0′ are inputted into the document analysis model 120. In the generated documents DC0′, the orders of sentence sets SC01, . . . , SC05 have been changed, and there is no order feature. Therefore, the document sentence concept labeling system 100 has a fairly high tolerance and robustness for various document structure variations.
For example, please refer to Table 2 below. Compared with the labeling method of AI2, the labeling method of the present disclosure can still maintain an accuracy rate (F1) that is quite close to that of the unlabeled document with the changed order. In contrast to the labeling method of AI2, the unlabeled document with the changed order will greatly reduce the accuracy rate, and it does not have a high tolerance and robustness to the variation of the document structure.
According to the above embodiment, the document sentence concept labeling system 100 has a fairly high accuracy and robustness in document structure analysis, and its application in relation extraction and document retrieval can achieve quite good results.
Traditionally, when performing relation extraction, words such as A disease, B gene and C drug can be searched out from the full text of certain documents, and then it is determined that the A disease, the B gene and the C drug are highly related. However, when the C drug is in the sentence set corresponding to the sentence concept of the “background”, the A disease and the B gene are in the sentence set corresponding to the sentence concept of the “contribution”, the C drug is actually not highly related to the A disease and the B gene, resulting in false recognition of relation extraction.
According to the present disclosure, it is possible to limit the search for the sentence set corresponding to the sentence concept of “contribution.” If the A disease, the B gene and the C drug often exist in the sentence set corresponding to the sentence concept of “contribution” in several documents, it can truly confirm that the A disease, the B gene and the C drug are highly related.
Please refer to
After the unlabeled document DC2 and the sentence concept CLi are inputted into the document sentence concept labeling system 100, the sentence set SCi can be obtained. The named entity recognition unit 220 generates several entities NEi of the unlabeled document DC2. The sentence segmentation unit 210 generates all the sentences Si in the unlabeled document DC2. The entity relation extraction unit 230 generates entity relation pairs according to whether the entities NEi are existed in the sentence set SCi corresponding to the sentence concept CLi. For example, the medical researchers want to know the entity relation among the A disease, the B gene and the C drug. The entity relation extraction unit 230 will observe whether the entities NEi including the A disease, the B gene and the C drug often appear together in the sentence set SCi corresponding to the specific sentence concept CLi to generate the correct entity relation pairs. That is to say, the relation extraction system 200 can identify whether an entity relation pair holds in the sentence sets SCi corresponding to the sentence concepts CLi in the unlabeled document DC2.
In addition, when searching for documents, the D virus is searched out from certain documents, and then it is determined that these documents are research papers for the D virus. However, when the D virus is in the sentence set corresponding to the sentence concept of the “background”, those documents are probably not researched on the D virus, and search errors occur.
According to the method of this embodiment, the search can be restricted in the sentence set corresponding to the sentence concept of “contribution”, or the sentence set corresponding to the sentence concept of “contribution” can be given a higher search priority, so that the correct document for the D virus can be found.
Please refer to
In the indexing phase, an unlabeled documents DC3 are inputted into the indexing unit 310. The indexing unit creates a document index for the unlabeled document DC3.
The document sentence concept labeling system 100 extracts several sentence sets SCj corresponding to the sentence concept CLj for each of the unlabeled documents DC3. The indexing unit 310 creates sub-documents and a sub-document index for the sentence sets SCj.
In the search phase, the query processing unit 320 receives a query condition q and a sentence concept CLj. After the query processing unit 320 generates a search condition, the indexing unit 310 searches in the sub-document index. The sub-documents that meet the search condition are sorted by the ranking unit 330, and the result representation unit 340 gives weighted scores to the sub-documents according to the search condition and returns the search result. That is to say, the document retrieval system 300 can identify whether the query condition q is met based on the sentence sets SCj corresponding to the sentence concepts CLj in the unlabeled document.
According to the above-mentioned embodiment, the document sentence concept labeling system 100 has a fairly high accuracy and robustness in document structure analysis, and its application in relation retrieval and document retrieval can achieve quite good results. Especially in the fields of the technical document analysis, the bidding document analysis, the academic paper analysis and the social opinion analysis, it gives great help.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed embodiments. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
109142019 | Nov 2020 | TW | national |