This application claims priority pursuant to 35 U.S.C. § 119 from Japanese Patent Application No. 2019-184595, filed on Oct. 7, 2019, the entire disclosure of which is incorporated herein by reference.
The present disclosure relates to a data extraction method and a data extraction device.
There are cases where a future shortage of skilled workers has been a problem in the fields involving highly specialized work. To address this situation, attempts have been made to build databases containing knowledge and views of skilled workers and to use them effectively. For example, text data in which knowledge and views of skilled workers are recorded is generated, and this text data is referred to by unskilled workers. In addition, it is also being studied that such text data is used for machine learning to create a pre-trained model on the work.
However, since the amount of such text data is usually extremely large, it is desirable also in the viewpoint of improving work efficiency to generate text data only including extracted necessary information.
Here, as data extraction methods, there have been proposed methods of extracting unique expressions from existing text data and creating databases in which the relationship (binary relationship) between the extracted unique expressions has been determined. For example, in a technique disclosed in Japanese Patent Application Publication No. 2007-4458 (hereinafter referred to as patent document 1), a binary-relationship extraction device extracts characteristics of a case from teacher data of binary relationships that occur in text data and makes combinations of sets of characteristics and solutions. Then, the device performs machine learning on what kind of character set leads to what kind of solution in those combinations. The devise also extracts, from text data, binary-relationship candidates and sets of characteristics of the binary-relationship candidates and presumes a solution for a case of a set of characteristics of binary-relationship candidates and the degree of the likeliness based on the learning result information. Then, the device extracts a binary-relationship candidate having a better degree of presumption of a correct solution.
Also, in a technique disclosed in Japanese Patent Application Publication No. 2011-227688 (hereinafter referred to as patent document 2), a device for extracting the relationship between two entities in a text corpus generates a first co-occurrence matrix having elements of frequencies at which each entity pair and each vocabulary pattern are associated, and the device sorts the entity pairs and the vocabulary patterns in the first co-occurrence matrix in the descending order of the frequencies to generate a second co-occurrence matrix. The device performs clustering on the entity pairs and the vocabulary patterns in the second co-occurrence matrix to obtain clusters of entity pairs and clusters of vocabulary patterns, and the device generates a third co-occurrence matrix the row of which is one of the obtained cluster of entity pairs and the obtained cluster of vocabulary patterns and the column of which is the other one and whose elements are the frequencies added by clustering.
Unfortunately, the method in patent document 1 has a disadvantage that text of teacher data needs to be given appropriate labels in advance and this makes the step of generating text data troublesome.
The method of patent document 2 does not require manual labeling, but in order to perform machine learning with high accuracy, a large amount of teacher data (entity pairs and vocabulary patterns) has to be prepared which is necessary for clustering based on the co-occurrence probability distribution. For this reason, in the case where the amount of accumulated text data is originally small (for example, in fields of highly specialized work), there has been a problem that machine learning cannot be performed with accuracy high enough for practical use.
The present disclosure has been made in light of the above situation, and an objective thereof is to provide a data extraction method and a data extraction device that are capable of extracting characteristic data from text data appropriately and efficiently according to the work field.
An aspect of the disclosure to solve the above objective is a data extraction device comprising: a label input part that receives, from a user, an input of the type of each component of at least one set of sentences and a designation of a topic portion in the component; a model creation part that creates a pre-trained model that has learned the type of each component of the set of sentences and a feature of the topic portion in the component of the set of sentences; a sentence-feature presuming part that inputs a specified set of sentences inputted by a user into the pre-trained model to presume each component of the specified set of sentences and a topic portion in each component of the specified set of sentences; a word-vector generation part that determines a relationship among each word in the specified set of sentences, the type of each presumed component, and the presumed topic portion to calculate a feature amount of each word; and a relationship extraction part that determines a relationship among each of the words based on the calculated feature amount.
Another aspect of the disclosure is a data extraction method comprising: a label input process of receiving, from a user, an input of the type of each component of at least one set of sentences and a designation of a topic portion in the component; a model creation process of creating a pre-trained model that has learned the type of each component of the set of sentences and a feature of the topic portion in the component of the set of sentences; a sentence-feature presuming process of inputting a specified set of sentences inputted by a user into the pre-trained model to presume each component of the specified set of sentences and a topic portion in each component of the specified set of sentences; a word-vector generation process of determining a relationship among each word in the specified set of sentences, the type of each presumed component, and the presumed topic portion to calculate a feature amount of each word; and a relationship extraction process of outputting information indicating a relationship among each of the words based on the calculated feature amount, wherein the label input process, the model creation process, the sentence-feature presuming process, the word-vector generation process, and the relationship extraction process are performed by an information processing device.
The present disclosure makes it possible to extract characteristic data from text data appropriately and efficiently according to the work field.
Issues, configurations, and effects other than those described above will be made clear by the description of the following embodiments.
Specifically, the work analysis system 1 includes the document server 10 that stores the sets of sentences 5, a data extraction device 20 that creates specified databases by using the sets of sentences 5, and an analysis device 30 that performs work analysis based on these databases.
The data extraction device 20 creates specified pre-trained models from the sentences in the sets of sentences 5 to generate information indicating the relationship among a plurality of unique expressions (various words or the like used in the work) in the sets of sentences 5 (hereinafter referred to as relationship information). In the present embodiment, the data extraction device 20 creates a database indicating relationship information.
The sets of sentences 5 include teacher text data 101 for creating pre-trained models and text data for learning 102 for making an input to the pre-trained models. The teacher text data 101 is, for example, data accumulated by then. The text data for learning 102 is, for example, data inputted by specified users every time a new task occurs.
Each set of sentences 5 are sentences in which information on work is recorded. The sets of sentences 5 are, for example, work logs, reports, papers, experiment reports, and the like.
The analysis device 30 performs specified work analysis using the databases created by the data extraction device 20. The analysis device 30 is, for example, a search device that searches for coping methods against problems that occurred on work or a simulator device that performs virtual experiments.
The document server 10, the data extraction device 20, and the analysis device 30 are communicably coupled, for example, by a wired or wireless communication network 7 such as the Internet, a local area network (LAN), and a wide area network (WAN).
Next, the following describes functions included in the data extraction device 20.
The paragraph division part 111 divides each set of sentences 5 (the teacher text data 101 and the text data for learning 102) into a plurality of components based on paragraph division rules 103 described later. In the present embodiment, it is assumed that a document in a set of sentences 5 is divided into a plurality of paragraphs. Here, the paragraph may include a description portion of a subitem (title) attached to the paragraph.
Here, the following describes an example of a set of sentences 5.
First, the set of sentences 5 have types or structures of paragraphs unique to the work field. In the example of
In addition, in the set of sentences 5, some paragraphs have a topic portion (for example, a sentence segment indicating the gist of the paragraph, which is hereinafter also referred to as a topic sentence-segment), and the topic portions are included in paragraphs of a specific type. In the example of
Based on such features unique to the set of sentences 5, the data extraction device 20 can extract relationship information.
Next, the sentence-element division part 112 illustrated in
The label input part 113 illustrated in
Note that the label input part 113 generates information combining the information received from the user and the set of sentences 5 as labeled text data 104.
The paragraph type data 901 is a database in which the type of each paragraph is recorded and which includes one or more records having the items of “paragraph ID” 911 that stores the identifiers (paragraph IDs) of paragraphs and “paragraph type” 912 that stores the paragraph types according to the paragraph IDs 911 (here, Type A, Type B, . . . ).
The sentence-element type data 902 is a database in which topic sentence-segments are recorded and which includes one or more records having the following items: “paragraph ID” 911 that stores paragraph IDs, “sentence ID” 914 that stores the identifier (sentence ID) of each sentence included in the paragraph according to the paragraph ID 911, “sentence element ID” 915 that stores the identifier (sentence element ID) of each sentence element in the sentence according to the sentence ID 914, “sentence-segment string” 916 that stores the content (character string) of the sentence element according to the sentence element ID 915, and “topic-sentence element” 917 in which information on the topic sentence-segment of the sentence element according to the sentence element ID 915 is set.
Here, in the topic-sentence element 917, if the sentence element has a topic sentence-segment, “1” is set. If the sentence element does not have a topic sentence-segment, “−1” is set. If whether the sentence element has a topic sentence-segment is unknown or whether the sentence element has a topic sentence-segment is going to be presumed with a topic discrimination model 106, “0” is set.
Next, the model creation part 120 illustrated in
Specifically, the model creation part 120 includes a paragraph-type-discrimination-model creation part 114 and a topic-discrimination-model creation part 115.
The paragraph-type-discrimination-model creation part 114 learns the relationship among the words in the set of sentences 5 and the types (paragraph types) of the components based on the labeled text data 104 generated by the label input part 113 and thereby creates, as a first pre-trained model 140, a paragraph-type discrimination model 105 that has memorized the relationship among the words and the types (paragraph type) of the components.
The topic-discrimination-model creation part 115 creates, as a second pre-trained model 140, a topic discrimination model 106 that has memorized the relationship among the types (paragraph types) of the components, the words in each component (paragraph), and the topic portion (topic sentence-segment) in the component (paragraph), based on the labeled text data 104 created by the label input part 113.
Note that the topic-discrimination-model creation part 115 creates, as the topic discrimination model 106, a model that has at least each word in a component (paragraph) and a word having a modification relationship with the word as a feature amount.
Next, as illustrated in
Specifically, the sentence-feature presuming part 130 includes a paragraph-type presuming part 116 and a topic presuming part 117.
The paragraph-type presuming part 116 inputs the specified set of sentences 5 (the text data for learning 102) inputted by the user into the first pre-trained model 140 (paragraph-type discrimination model 105) to presume the type (paragraph type) of each component of the specified set of sentences 5.
The topic presuming part 117 inputs the specified set of sentences 5 (text data for learning 102) into the second pre-trained model 140 (topic discrimination model 106) to presume the topic portion (topic sentence-segment) of each component (paragraph) of the specified set of sentences 5.
Note that the topic presuming part 117 inputs each word in a component (paragraph) of the specified set of sentences 5 (text data for learning 102) and a word having a modification relationship with the word into the topic discrimination model 106 to presume the topic portion.
The word-vector calculation part 118 determines the relationship among each word in the specified set of sentences (text data for learning 102), the type of each component presumed by the paragraph-type presuming part 116, and the topic portion presumed by the topic presuming part 117 and thereby calculates the feature amount of each word.
Specifically, the word-vector calculation part 118 learns the relationship among each word in the specified set of sentences 5 (text data for learning 102), the type (paragraph type) of each component presumed by the paragraph-type presuming part 116, and the topic portion (topic sentence-segment) presumed by the topic presuming part 117, thereby creates a co-occurrence-word presuming model 108 that has memorized the relationship among the occurrence of words in the specified set of sentences 5, the types (paragraph types) of the components in the specified set of sentences 5, and the topic portions (topic sentence-segments) of the components (paragraphs), and calculates the feature amount of each word based on the created co-occurrence-word presuming model 108.
The relationship extraction part 119 extracts a plurality of words having a relationship with one another based on the feature amounts calculated by the word-vector calculation part 118.
The output part 150 outputs the plurality of words having a relationship with one another extracted by the relationship extraction part 119. The information on the words is recorded in a unique-expression relationship DB 107 described later. Note that the analysis device 30 is capable of outputting specified analysis results 109 on the work based on this unique-expression relationship DB 107.
Here,
Next, the following describes the process performed by the work analysis system 1.
The following describes each of the discrimination-model creation process and the relationship-information generation process.
First, the paragraph division part 111 of the data extraction device 20 receives teacher text data 101 from the document server 10 and divides each set of sentences 5 in the received teacher text data 101 into paragraphs (a first paragraph division process s11).
Then, the sentence-element division part 112 restructures the paragraphs resulting from the division by the paragraph division part 111 into sentence elements (a first sentence element division process s13).
Meanwhile, the label input part 113 receives an input of the paragraph type of each paragraph in each set of sentences 5 in the teacher text data 101 and an input of the topic sentence-segment of each paragraph in each set of sentences 5 (a label input process s15). Note that the content of each input is recorded in the paragraph type data 901 and sentence-element type data 902 of the labeled text data 104.
Then, the paragraph-type-discrimination-model creation part 114 creates a paragraph-type discrimination model 105 based on the paragraph type data 901 generated at s15 (a paragraph-type-discrimination-model creation process s17). The topic-discrimination-model creation part 115 creates a topic discrimination model 106 based on the sentence-element type data 902 generated at s15 (a topic-discrimination-model creation process s19).
The following describes details of each process in the discrimination-model creation process.
Here, the paragraph division rules 103 and the text data with paragraph information 301 are described below.
Text Data with Paragraph Information
The following describes a first sentence-element division process.
The first sentence-element division part 112 refers to the syntax-analysis result data 501 to determine all the modification relationships (paths) of all the sentence segments in each paragraph of the teacher text data 101 and thereby determines the sentence elements of each sentence (s133). Then, the first sentence-element division part 112 stores information on the determined sentence elements in sentence-element division-result data 601 described later (s135).
Note that
Next, the following describes the label input process.
Then, the label input part 113 generates the labeled text data 104 based on the inputted data (s153).
Next, the following describes the paragraph-type-discrimination-model creation process.
Specifically, for example, the paragraph-type-discrimination-model creation part 114 obtains the contents of the sentences of all the paragraphs of which the paragraph types 912 of the paragraph type data 901 are not “0” from the text data with paragraph information 301 and generates word vectors on the obtained sentences of each paragraph, each vector having elements of the constituent words of each sentence. Note that this process can be achieved, for example, by doc2vec.
Then, the paragraph-type-discrimination-model creation part 114 learns the relationship among the word vectors of each paragraph generated at s171 and the paragraph type (paragraph type data 901) of each paragraph and thereby creates the paragraph-type discrimination model 105 (s173).
Next, the following describes the topic-discrimination-model creation process.
Then, the topic-discrimination-model creation part 115 combines information on the subject extracted at s191, the paragraph type data 901, and the sentence-element type data 902 and thereby generates topic-discrimination-model teacher data 1201 described below (s193) which is teacher data used as a base for creating the topic discrimination model 106 according to the teacher text data 101.
Then, the topic-discrimination-model creation part 115 learns the content of each record of the topic-discrimination-model teacher data 1201 generated at s193 to create the topic discrimination model (s195).
The explanatory variable 1211 has the following subitems: “paragraph type” 1213 that stores paragraph types, “constituent word” 1214 that stores a list of the constituent words (excluding the subject) of a sentence element in the paragraph according to the paragraph type 1213, and “indirect modification word” 1215 that stores a list of the words (in other sentence elements) that have modification relationships with the sentence element.
The response variable 1212 has the item of “topic-sentence element” 1216. In the topic-sentence element 1216, if the sentence element has a topic sentence-segment, “1” is set by the user. If the sentence element does not have a topic sentence-segment, “−1” is set by the user. If whether the sentence element has a topic sentence-segment is unknown or whether the sentence element has a topic sentence-segment is going to be presumed by the topic discrimination model 106, “0” is set by the user.
Note that
Next, the following describes the relationship-information generation process.
Then, the sentence-element division part 112 executes a process similar to the first sentence-element division process (a second sentence-element division process s33). Specifically, the sentence-element division part 112 restructures each paragraph in the set of sentences 5 in the text data for learning 102 into sentence elements based on the results of the second paragraph division process.
Note that the results of the second paragraph division process and the second sentence-element division process are recorded in the labeled text data 104 (the paragraph type data 901 and the sentence-element type data 902), as in the discrimination-model creation process.
Next, the paragraph-type presuming part 116 inputs information on each paragraph resulting from the division at s31 (paragraph type data 901) into the paragraph-type discrimination model 105 created in the discrimination-model creation process and thereby presumes the paragraph type of each paragraph in the text data for learning 102 (a paragraph-type presuming process s35).
The topic presuming part 117 inputs information on each paragraph (the paragraph type data 901) resulting from the division at s31 and also information on the sentence elements of each paragraph (the sentence-element type data 902) into the topic discrimination model 106 created in the discrimination-model creation process and thereby presumes the topic sentence-segment of the text data for learning 102 (a topic presuming process s37).
Then, for each word in the text data for learning 102, the word-vector calculation part 118 calculates, based on the word, the paragraph type presumed at s35, and the topic sentence-segment presumed at s37, word vectors the feature amounts of which are these (a word-vector calculation process s39).
Then, the relationship extraction part 119 analyzes each word vector calculated at s39 to output relationship information (a relationship extraction process s41).
The following describes details of each process in the relationship-information generation process. Note that as described above, the processes in the second paragraph division process and the second sentence-element division process are the same as or similar to those in the first paragraph division process and the first sentence-element division process, respectively, and hence, description thereof is omitted.
Specifically, for example, the paragraph-type presuming part 116 obtains the contents of the sentences in each paragraph of which the paragraph type 912 of the paragraph type data 901 is “0” from the text data with paragraph information 301 and generates word vectors on the obtained sentences in each paragraph (the word vector having elements of the constituent words of the paragraph). Note that this process can be achieved, for example, by doc2vec.
Then, the paragraph-type presuming part 116 inputs the word vectors of each paragraph generated at s351 into the paragraph-type discrimination model 105 to presume the paragraph type of each paragraph (s353).
Next,
The topic presuming part 117 inputs the contents of the records of the topic-discrimination-model teacher data 1201 generated at s371 into the topic discrimination model 106 and thereby presumes a topic sentence-segment of each paragraph in each set of sentences 5 in the text data for learning 102 (s373). Specifically, for example, the topic presuming part 117 inputs the contents of the records the topic-sentence elements 1216 of which are “0” among the topic-discrimination-model teacher data 1201 into the topic discrimination model 106 to presume the topic sentence-segment.
Next,
The word-vector calculation part 118 of the data extraction device 20 extracts subjects from the sentence segments in each sentence of the set of sentences 5 in the text data for learning 102 (s391). Specifically, for example, the word-vector calculation part 118 determines subjects by analyzing the contents of the sentence-segment strings 916 in the records of the text data for learning 102, among the sentence-element type data 902 of the labeled text data 104.
The word-vector calculation part 118 combines information on the subjects extracted at s391, the paragraph type data 901, and the sentence-element type data 902 and thereby generates co-occurrence-word presuming model teacher-data 1601 described later which is teacher data used as a base for creating the co-occurrence-word presuming model 108 (s393).
The word-vector calculation part 118 learns the content of each record in the co-occurrence-word presuming model teacher-data 1601 generated at s393 to create the co-occurrence-word presuming model 108 described next (s395).
Then, the word-vector calculation part 118 extracts the word vector of each word from the co-occurrence-word presuming model 108 created at s395 (s397).
Here, the following describes the co-occurrence-word presuming model teacher-data 1601 and the co-occurrence-word presuming model 108.
The explanatory variable 1611 has subitems including “paragraph type” 1613 that stores paragraph types, “topic-sentence element” 1614 that stores information on whether the paragraph according to the paragraph type 1613 has a topic sentence-segment, “word” 1615 that stores a list of the words that, in the case where the paragraph according to the paragraph type 1613 has a topic sentence-segment, occur in the topic sentence-segment (excluding the words according to the response variable 1612 described later).
In the topic-sentence element 1614, if the paragraph according to the paragraph type 1613 has a topic sentence-segment, “1” is set, and if it does not have a topic sentence-segment, “0” is set.
The response variable 1612 has the item of “word” 1616. The word 1616 stores one word other than the words according to the words 1615 among the constituent words of the paragraph according to the paragraph type 1613.
Next, the following describes the relationship extraction process.
Then, on the basis of the results of the clustering process at s411, the relationship extraction part 119 determines a combination of words determined to belong to the same cluster (here, assume two words) as a set of words having co-occurrence relationships and stores the determined results in the unique-expression relationship DB 107 (s413).
Here,
Note that after that, the data extraction device 20 can perform a specified work analysis by inputting the created unique-expression relationship DB 107 into the analysis device 30. For example, the analysis device 30 receives an input of information on a problem case on work from a user unskilled in the work and can output analysis results 109 which are information on coping methods of the failure, based on the information the input of which has been received and the unique-expression relationship DB 107 (or a specified pre-trained model created based on the unique-expression relationship DB 107).
Next, the following describes a work analysis system 1 according to a second embodiment. In the work analysis system 1 according to the present embodiment, it is assumed that the set of sentences 5 are English sentences. In this case, details of the sentence-element division process and the topic-discrimination-model creation process in the work analysis system 1 are significantly different from those in the first embodiment. Hence, these processes are described in detail below.
Then, the first sentence-element division part 112 performs a syntax analysis on each sentence segment determined at s531 to determine the sentence elements (the constituent words in simple clauses except those in sub simple clauses) and replaces relative pronouns with their antecedents (s533).
The first sentence-element division part 112 determines the content of each simple clause (sentence element) determined at s533 and records the determined content in sentence-element division-result data 2201 (s535) described later.
Here, the syntax-analysis result data 2101 and the sentence-element division-result data 2201 are described below.
As above, the sentence-element division process in the second embodiment additionally includes a correction process for the difference in syntax between Japanese and English such as a complex sentence including relative pronouns.
The other processes are the same as or similar to those in the first embodiment.
Next, the following describes a topic-discrimination-model creation process according to the second embodiment.
Then, the topic-discrimination-model creation part 115 generates topic-discrimination-model teacher data 1201 based on information on the subjects extracted at s391 as in the first embodiment (s393).
The topic-discrimination-model creation part 115, as in the first embodiment, learns the content of each record of the topic-discrimination-model teacher data 1201 generated at s393 to create the topic discrimination model 106 (s395).
Here, the following describes the topic-discrimination-model teacher data 2401.
As above, in the topic-discrimination-model creation process of the second embodiment, the method of extracting subjects is different from the one in the first embodiment.
As has been described above, the data extraction device 20 of the present embodiment receives, from the user, an input of the type (for example, the paragraph type) of each component of each set of sentences 5 according to the teacher text data 101 and a designation of the topic portion (for example, the topic sentence-segment) in the component. The data extraction device 20 also creates a pre-trained model 140 according to the words in the set of sentences 5, the types of the components, and the topic portions. Then, the data extraction device 20 inputs the set of sentences 5 (the text data for learning) inputted by the user to the pre-trained model 140, thereby presumes the types of the components in the set of sentences 5 and the topic portions, calculates the feature amount of each word in the set of sentences 5, and extracts a plurality of words having a relationship with one another based on the calculated feature amounts.
Specifically, in sets of sentences 5 on a work field, the types of components and how topic portions occur have, in general, limited patterns. Hence, since these input and designation are received from the user on the set of sentences 5 which serves as teacher data, pre-trained models 140 with which the contents of the set of sentences 5 can be presumed with high accuracy can be at least created reliably even if the amount of the teacher data is small. The use of pre-trained models 140 as above makes it possible to appropriately and efficiently presume the features of the words in a new set of sentences 5 (text data for learning 102) specified by the user. As a result, it is possible to extract and output the words in the set of sentences 5 having a relationship with each other (for example, unique expressions having co-occurrence relationships). Thus, the data extraction device 20 of the present embodiment is capable of extracting characteristic data from text data appropriately and efficiently according to the work field.
The description of the above embodiments is for making it easy to understand the present disclosure, and it is not for limiting the present disclosure. The present disclosure can be changed or improved without departing from the spirit, and the present disclosure also includes equivalents thereof.
For example, although it is assumed in the present embodiments that the work field of the set of sentences 5 is a railway business, the present disclosure is applicable to other businesses involving specific work as long as the set of sentences 5 has features on the component unit (paragraph type) and the topic portion (topic sentence-segment).
In addition, although it is assumed in the present embodiments that the components of the set of sentences 5 are paragraphs, other units may be employed. For example, the components may be pages, sections, chapters, clauses, or the like. Depending on what the components are, the contents in the paragraph division rules 103 are changed.
In addition, although the types of the component units (paragraphs) of the set of sentences 5 in the present embodiments are titles 50(1), introductions 50(2), and the like, other types of paragraphs may be used as long as paragraphs may be classified into specific types.
In addition, although it is assumed in the present embodiments that topic portions are sentence segments, topic portions may be sentences or other portions.
In addition, although the description in the present embodiments are based on the cases where the language of text data is Japanese or English, the language may be other ones.
In addition, although the unique-expression relationship DB 107 in the present embodiments is a database in which co-occurrence relationships between two words are recorded, it may be one in which relationships among three or more words are recorded.
In addition, although in the present embodiments, the two models of the paragraph-type discrimination model 105 and the topic discrimination model 106 are built as pre-trained models that have learned the features of the set of sentences 5, the pattern of pre-trained models that are created is not limited to this example. For example, one pre-trained model that simultaneously outputs paragraph types and the features of topic sentence-segments may be built, or three or more pre-trained models may be combined.
According to the above description of the present specification, at least the following is clear. Specifically, in the data extraction device of the present embodiments, the model creation part may include a paragraph-type-discrimination-model creation part and a topic-discrimination-model creation part, the paragraph-type-discrimination-model creation part may learn a relationship between each word in the set of sentences and the type of the component to create a paragraph-type discrimination model that has memorized the relationship between the word and the type of the component, as a first pre-trained model, the topic-discrimination-model creation part may create a topic discrimination model that has memorized a relationship among the type of the component, the words in the component, and the topic portion in the component, as a second pre-trained model, the sentence-feature presuming part may include a paragraph-type presuming part and a topic presuming part, the paragraph-type presuming part may input a specified set of sentences inputted by a user into the first pre-trained model to presume the type of each component of the specified set of sentences, and the topic presuming part may input the specified set of sentences into the second pre-trained model to presume a topic portion in each component of the specified set of sentences.
As above, since the paragraph-type discrimination model 105 that has memorized the relationship among the words and the types (paragraph types) of the components and the topic discrimination model 106 that has memorized the relationship among the type (paragraph type) of a component, the words in the component (paragraph), and the topic portion (topic sentence-segment) in the component (paragraph) are used as the pre-trained models 140, it is possible to extract the features of the teacher text data 101 accurately according to the work field.
In addition, in the data extraction device of the present embodiments, the topic-discrimination-model creation part may create, as the topic discrimination model, a model having at least each word in the component and a word having a modification relationship with the word as a feature amount, and the topic presuming part may input each word in the component of the specified set of sentences and a word having a modification relationship with the word into the topic discrimination model to presume the topic portion.
As above, since the topic portion is presumed with the topic discrimination model 106 that has at least each word in the component (paragraph) and a word having a modification relationship with the word as a feature amount, it is possible to determine the topic portion of the paragraph accurately according to the context.
In addition, in the data extraction device of the present embodiments, the word-vector generation part may learn the relationship among each word in the specified set of sentences, the type of each presumed component, and the presumed topic portion to create a co-occurrence-word presuming model that has memorized the relationship among the occurrence of the words in the specified set of sentences, the type of a component in the specified set of sentences, and a topic portion in the component, and the word-vector generation part may calculate a feature amount of each word based on the created co-occurrence-word presuming model.
As above, since the feature amount of each word is calculated based on the co-occurrence-word presuming model 108 that has memorized the relationship among the occurrence of the words in the specified set of sentences 5 (the teacher text data 101 in the document server 10), the types (paragraph types) of the components in the specified set of sentences 5, and the topic portion (topic sentence-segment) of the components, it is possible to calculate the feature amount that corresponds to the paragraph type and the topic sentence-segment which are the features of the set of sentences 5 and reflects the characteristic of each word sufficiently in the work field.
In addition, the data extraction device of the present embodiments may further comprise an output part that outputs a plurality of the extracted words having a relationship with one another.
As above, since information on a plurality of words having a relationship with one another is outputted, the user can use the content of the output to perform work analysis or the like efficiently. For example, the user can perform data search based on the outputted information or can create a new pre-trained model for performing work analysis.
Number | Date | Country | Kind |
---|---|---|---|
2019-184595 | Oct 2019 | JP | national |