The present invention relates to a method and a device for generating semantic representation data necessary for obtaining knowledge from text data such as a document described in a natural language.
Recently studied and developed is a method for obtaining knowledge from various types of text data (referred to as “natural language data” hereinafter) such as a document described in a natural language by a computer. Also studied and developed is a method for structuring and storing the knowledge obtained in such a manner to generate a knowledge base, and answering a question in the natural language based on the knowledge base upon receiving the question in a computer.
A meaning of a word included in a sentence needs to be hierarchically and ambiguously captured in a semantic analysis of natural language data by a computer to appropriately obtain the knowledge and answer the question using the natural language as with the case described above. In response to this, a concept of a specific representation included in natural language data has been conventionally defined hierarchically (for example, refer to Koichi Takeuchi, Alastair Butler, Iku Nagasaki, Takuya Okamura, Prashant Pardeshi, “Constructing Web-Accessible Semantic Role Labels and Frames for Japanese as Additions to the NPCMJ Parsed Corpus”, Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 3153-3161, Marseille, 11-16 May 2020).
Performed is an analysis of a structure of a sentence from a viewpoint of a relationship between an argument as a complement necessary for a predicative to make sense and a predicative (referred to as “predicate-argument structure”) while treating the predicative such as a verb or an adjective as a central meaning to capture the meaning of the sentence in a semantic interpretation on natural language data (for example, refer to Koichi Takeuchi, Masayuki Ueno, and Nao Takeuchi, “Annotating Semantic Role Information to Japanese Balanced Corpus”, Proceedings of MAPLEX 2015, 2015).
In addition, Japanese Patent Application Laid-Open No. 2021-111303 and Japanese Patent Application Laid-Open No. 6-195383 relate to the present invention.
A meaning of a word or a meaning of a sentence in the natural language data cannot necessarily be represented appropriately in the semantic representation data obtained by the conventional semantic analysis by the computer as described above. As a result, a degree of accuracy for obtaining the knowledge from the natural language data is not sufficient, and reusability of the obtained knowledge is not sufficiently high.
Accordingly, it is desired to provide a method etc. of generating semantic representation data capable of representing a meaning of a word and a meaning of a sentence in natural language data more appropriately and sufficiently than ever before,
A first aspect according to the present invention is a semantic representation generation method of generating semantic representation data from a natural language including a content word and a function word, comprising:
A second aspect of the present invention is the semantic representation generation method according to the first aspect of the present invention, wherein
A third aspect according to the present invention is a semantic representation generation device generating semantic representation data from a natural language including a content word and a function word, comprising:
A fourth aspect according to the present invention is a computer-readable recording medium recording a semantic representation generation program for generating semantic representation data from a natural language including a content word and a function word, wherein
Another aspect of the present invention is obvious from the description of the above aspect of the present invention and an embodiment and a modification example thereof described hereinafter, thus the description thereof is omitted.
According to the above first aspect of the present invention, obtained is the semantic representation data more appropriately representing a meaning of a word included in the text data described in the natural language than ever before.
According to the above second aspect of the present invention, obtained is the semantic representation data more appropriately and sufficiently representing not only a meaning of a word but also a meaning of a sentence from the text data described in the natural language than ever before.
Both the above third aspect and the above fourth aspect of the present invention have an effect similar to the above first aspect of the present invention.
The effect of another aspect of the present invention is obvious from the description of the effect of the above aspect of the present invention and the effect of an embodiment described hereinafter, thus the description thereof is omitted.
These and other objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.
It is important to generate semantic representation data capable of sufficiently representing a meaning of a word and a meaning of a sentence in natural language data to increase a degree of accuracy of obtaining knowledge in establishing a knowledge base from the natural language data and achieving a question answering system in natural language. An embodiment of a device and a method for generating such semantic representation data is described hereinafter with reference to the drawings. A semantic representation generation device according to the present embodiment is typically achieved using a computer, and a semantic representation generation method according to the present embodiment is typically executed using a computer. A semantic representation generation program according to the present embodiment is used to make the computer function as the semantic representation generation device.
<1. Functional Configuration of Semantic Representation Generation Device>
As illustrated in
In such a semantic representation generation device 10, the natural language analysis part 110 reads the text data which is data as an analysis target and described in the natural language from the text data storage part 100. In the natural language analysis part 110, the morphological analysis part 112 firstly performs a morphological analysis on text data (referred to as “input text data” hereinafter) Din which has been read, thereby generating data (referred to as “spaced-writing data” hereinafter) D1 in which the input text data is separated for each morpheme. In this the morphological analysis, a part of speech and an inflected form of a morpheme included in the spaced-writing data D1 are also determined. In the morphological analysis, a concept tag (also referred to as “CT” hereinafter) is provided to each morpheme in the spaced-writing data D1 with reference to the CT system table 33.
The syntax analysis part 114 performs the syntax analysis on the spaced-writing data D1 as a result of the morphological analysis, thereby generating syntax data D2 representing a structure (a dependency structure and a phrase structure) for each sentence included in the input text data Din.
The context analysis part 116 performs the context analysis on the input text data Din based on the syntax data D2 described above, thereby specifying an antecedent referenced by an anaphor included in the input text data Din, and specifies a pair of sentences having a discourse relation in the input text data Din to generate context data representing an anaphoric relation and the discourse relation in the input text data Din, and outputs context-syntax data D3 made up of the context data and the syntax data D2 described above. The morphological analysis part 112, the syntax analysis part 114, and the context analysis part 116 are collectively referred to as a text analysis part 246 in some cases hereinafter.
The semantic analysis part 118 provides, based on the context-syntax data D3 described above, the semantic tag (also referred to as “ST” hereinafter), which indicates the semantic information representing the semantic relation between a phrase or a sequence of phrases (“the phrase or the sequence of phrases” is also referred to as “the phrase/sequence of phrases” hereinafter) and the other phrase/sequence of phrases in a pair having a modification relation in the input text data Din, to the pair thereof with reference to the ST system table 34 described above. The semantic analysis part 118 generates the semantic representation data 140 corresponding to the input text data Din based on the concept tag provided to each morpheme included in the syntax data D2 described above and the semantic tag provided to the pair. The other type of semantic tag is provided also between a sentence and a sentence having a discourse relation, and is described hereinafter. The semantic tag indicating the semantic information representing the semantic relation between the phrase/sequence of phrases and the phrase/sequence of phrases having the modification relation is also referred to as a first semantic tag hereinafter.
<2. Hardware Configuration of Semantic Representation Generation Device>
In the computer 20 having the above configuration, the auxiliary storage device 23 stores text data 32 as an analysis target, the CT system table 33, and the ST system table 34 in addition to a semantic representation generation program 31 according to the present embodiment. The auxiliary storage device 23 stores the text data 32, thus the text data storage part 100 in the semantic representation generation device 10 in
When the semantic representation generation program 31 is executed in the computer 20, the main memory 22 loads the semantic representation generation program 31, and the main memory 22 partially or wholly loads the text data 32 as the input text data Din. The CPU 21 uses the main memory 22 as an operation memory, and executes the semantic representation generation program 31 stored in the main memory 22, thereby performing a semantic representation generation process on the input text data Din stored in the main memory 22. The semantic representation data 140 corresponding to the input text data Din is generated by this semantic representation generation process. When the CPU 21 performs the semantic representation generation process, the computer 20 functions as the semantic representation generation device 10. The configuration of the computer 20 described above is only one example, thus the semantic representation generation device 10 can be achieved using various computers.
<3. CT System Table and ST System Table>
In the present embodiment, a CT system table and an ST system table described hereinafter are previously prepared, and are stored in the auxiliary storage device 23 as described above (
” (park) and a noun “
” (school) is “name of public facility”, and a higher-level concept thereof is “space”, for example. It is also recorded that a high-level concept of a noun “
” and a noun “
” (company) is “name of organization”, a higher-level concept thereof is “stand-alone organizational object”, and a still higher-level concept thereof is “stand-alone object”. That is to say, as for “
”, concept information hierarchically and ambiguously representing a meaning thereof is recorded. It is recorded that a concept representing a meaning of a postposition “
” is “state”, “operation source”, or “causal reason”, and a high-level concept of the concept thereof is “other party”, for example. That is to say, as for the postposition “
”, concept information hierarchically and ambiguously representing a meaning thereof is recorded.
The ST system table is a table associating a rule for determining a pair of phrase/sequence of phrases to which one of a plurality of semantic tags (ST) should be provided for each of the plurality of semantic tags indicating plural pieces of semantic information, respectively, each representing a semantic relation between a phrase/sequence of phrases and a phrase/sequence of phrases in Japanese as the natural language. As illustrated in
<4. Semantic Representation Generation Process>
As described above, the CPU 21 executes the semantic representation generation program 31 in the computer 20, thus a semantic representation generation process is performed on text data of a natural language as an analysis target document.
As illustrated in
Next, the morphological analysis is performed on the input text data Din (Step S12). As illustrated in
Subsequently, the concept tag (CT) is provided to each morpheme in the input text data Din with reference to the CT system table 33 (Step S124). As described above, recorded in the CT system table 33 is the concept information hierarchically and ambiguously representing the meaning of the morpheme used in the natural language (refer to
” (Taro went to the park.) This text is divided into seven morphemes as illustrated in
Next, the spaced-writing data D1 corresponding to the input text data Din is generated based on the delimiter of the morphemes and provision of the part of speech and the concept tag to each morpheme in the input text data Din in Step S122 and S124 described above (Step S126). When the spaced-writing data DI is generated, the morpheme process (Step S12) is finished, and the process proceed to Step S14 in
As illustrated in ” (Taro went to the park.) does not include the sequence of phrases, but is made up of three phrases (“
” (Taro), “
” (to the park), and “
” (went)) as illustrated in
Generated subsequently in this syntax analysis is the syntax data D2 representing the structure (the dependency structure and the phrase structure) of each sentence included in the input text data Din based on the dependency structure and the phrase structure obtained as described above (Step S146). When the syntax data D2 is generated, the syntax analysis (Step S14) is finished, and the process proceed to Step S16 in
As illustrated in
As illustrated in
When the semantic tag is provided to the pair of text constituent elements included in the input text data Din in Step S182, the semantic representation data 140 corresponding to the input text data Din is generated next based on the concept tag provided to each morpheme in the input text data Din and the semantic tag provided to the pair of text constituent elements having the semantic relation in the input text data Din (Step S184).
For example, data of a semantic representation illustrated in ” (Taro went to the park.) illustrated in
”, “
”, and “
” as the text constituent elements are nodes, an edge is provided between the nodes semantically related to each other (between the phrases having the modification relation), the semantic tag (ST) (herein, the first semantic tag) indicating semantic information representing a semantic relation between the nodes is provided to the edge, and the concept tag (CT) indicating concept information representing each meaning is provided to each of the “
”, “
”, “
”, “
”, “
”, “
”. In the computer 20 as the semantic representation generation device 10 according to the present embodiment, the semantic representation data 140 of an appropriate data structure (a data structure appropriate for a process in a computer) corresponding to a semantic representation illustrated in
A semantic tag “lfp” indicating semantic information of “spatial terminal” representing a semantic relation between two phrases “” (to the park) and “
” (went) is provided between the two phrases (refer to
” (Taro) and “
” (went) is provided between the two phrases. However, a semantic relation represented by semantic information of “experiencer, . . . ” indicated by a semantic tag “exp” is also determined to fall under a modification relation of these phrases depending on a determination method (ST provision rule) in the ST system table 34 (refer to
” (Taro) and “
” (went).
The semantic representation data 140 described above is generated in Step S184, the semantic analysis (Step S18) is finished. As illustrated in
A process by a well-known method or a publicly known method may be adopted to the specific process of the morphological analysis (
<5. Generation Example of Semantic Representation Data>
<5.1 First Generation Example>
” (Taro went to cheer with Hanako.) illustrated in
In the present example, the text in ”, “
”, “
”, “
”, “
”, “
”, “
”, “
”, and “○” as illustrated in
” and “
” of a postposition in this text, concept information provided with regard to “
” in the CT system table 33 does not hierarchically express a meaning thereof, however, concept information provided with regard to “
” in the CT system table 33 hierarchically and ambiguously represents a meaning thereof. That is to say, as illustrated in
”, a concept representing a meaning thereof is recorded as “result”, “comparison standard”, “cooperative party”, “citation”, or “limitation”, and an upper concept of these concepts is recorded as “other party”. As illustrated in
” in the present example. In a stage of the morphological analysis, with regard to a morpheme having ambiguous meanings such as the preposition “
”, the plurality of concept tags may be ambiguously provided to the morpheme. However, in a case where the plurality of concept tags have meanings opposite to each other, the concept tag is provided again to the morpheme in accordance with the concept tags of morphemes before and after the morpheme in the text to be analyzed when the semantic tag is provided to the pair of text constituent elements (pair of phrases etc.) in a stage of the semantic analysis (S182 in
Next, a dependency structure and a phrase structure of the text of the present example (
Subsequently, the semantic analysis (Step S18) is performed through the context analysis (Step S16). In accordance with the semantic analysis, semantic tags (herein, the first semantic tags) “agt” (or “agt, exp”), “jnt”, and “pur” are provided to the pairs of the text constituent elements semantically related to each other in the text of the present example (herein, three pairs of phrases each having a modification relation: “” (Taro) and “
” (went); “
” (with Hanako) and “
” (went); and “
” (to cheer) and “
” (went)), and the semantic representation data as illustrated in
”, “
”, “
”, “
”, “
”, “
”, “
”, and “
” and the semantic tags (ST) provided to the pairs of the text constituent elements semantically related to each other. In the semantic representation data 140, the semantic tag (“agt” or “agt, exp”) provided between the two phrases “
” (Taro) and “
” (went) is the same as the example illustrated in
” (with Hanako) and “
” (went) is a semantic tag “jnt” indicating semantic information of “cooperative participant” representing a semantic relation between the two phrases based on a determination method (an ST provision rule for determining the semantic tag ST which should be provided between a phrase and a phrase) in the ST system table 34 in
” (to cheer) and “
” (went) is a semantic tag “pur” indicating semantic information of “purpose” representing a semantic relation between the two phrases.
<5.2 Second Generation Example>
” (A heating wire was hot.) and “
” (A heating wire was softened.) illustrated in
In the present example, in the text in ” (A heating wire was hot.) is divided into six morphemes “
”, “
”,“
”, “
”, “
”, and “
” and a second sentence “
” (A heating wire was softened.) is divided into six morphemes “
”, “
”, “
”, “
”, “
”, and “
” as illustrated in
Subsequently, the semantic analysis (Step S18) is performed through the syntax analysis (Step S14) and the context analysis (Step S16). In accordance with the semantic analysis, a semantic tag “gnr” is provided to the pair of the text constituent elements semantically related to each other in the text (” (A heating wire) and “
” (was hot))). A semantic tag “cap” is provided to one pair of phrases (the pair of “
” (A heating wire) and “
” (was softened)) having a modification relation in the second sentence. The semantic representation data as illustrated in
”, “
”, “
”, “
”, “
”, “
”, “
”, “
”, “
”, and “
” and the semantic tag (ST) provided to the pair of the text constituent elements semantically related to each other.
In the semantic representation data 140, provided between the two phrases “” (A heating wire) and “
” (was hot) in the first sentence is a semantic tag “gnr” indicating semantic information of “general relation” representing a semantic relation between the two phrases based on a determination method in the ST system table 34 (an ST provision rule for determining the semantic tag ST which should be provided between a phrase and a phrase) illustrated in
” (A heating wire) and “
” (was softened) in the second sentence is a semantic tag “cap” indicating semantic information of “object causing event without intention” representing a semantic relation between the two phrases. Both the semantic tags “gnr” and “cap” provided herein are the first semantic tag. A semantic tag “eq” indicating semantic information of “equivalent” representing a semantic relation between the phrase “
” (A heating wire) in the first sentence and the phrase “
” in the second sentence is provided between those phrases based on the context-syntax data D3 in the present example. The semantic tag “eq” provided herein does not correspond to a modification relation, and is not the first semantic tag. A semantic tag (ST) provided to a phrase/sequence of phrases in a pair having a semantic relation regardless of presence or absence of the modification relation is temporarily referred to as a second semantic tag. The second semantic tag does not depend on presence or absence of the modification relation, thus can also be considered a concept including the first semantic tag corresponding to the modification relation.
In the semantic representation data 140, the second sentence “” (A heating wire was softened.) is determined to fall under “result” based on the context-syntax data D3 in the present example. A semantic tag “cau” indicating semantic information representing a semantic relation (cause) between the phrase “
” (was hot) corresponding to a predicative of the first sentence and the phrase “
” (was softened) corresponding to a predicative of the second sentence (an edge from “
” toward “
”) is provided between those two phrases based on the determination result and the above semantic tags “gnr”, “cap”, and “eq” provided between the phrases in the present example. The semantic tag “cau” provided herein falls under the third semantic tag described above.
<5.3 Third Generation Example>
In the present example, the text in ” (I found a helpful book in a book store.), a second sentence “
” (The book was red and cheap.), and a third sentence “
” ((I) bought it immediately.). The semantic analysis (Step S18) is performed on the text through the morphological analysis (Step S12), the syntax analysis (Step S14), and the context analysis (Step S16) in the semantic representation generation process, thus the first “
” (I found a helpful book in a book store.), the second sentence “
” (The book was red and cheap.), and the third sentence “
” ((I) bought it immediately.) are divided into morphemes (not shown in the drawings). Then, the concept tag (CT) is provided to each morpheme (refer to
In the semantic representation data 140, a semantic tag “sit” indicating semantic information of “state, condition, or case” representing a semantic relation between two phrases “” (helpful) and “
” (book) is provided between the two phrases in the first sentence based on the determination method of the ST system table 34 (the ST provision rule for determining the semantic tag ST which should be provided between the phrases) illustrated in
” (book) and “
” (found) is provided between the two phrases (corresponding to the first semantic tag). A semantic tag “loc” indicating semantic information of “spatial position” representing a semantic relation between two phrases “
” (in a book store) and “
” (found) is provided between the two phrases (corresponding to the first semantic tag). A semantic tag “agt” indicating semantic information of “behavior, acting subject having intention” representing a semantic relation between two phrases “
” (I) and “
” (found) is provided between the two phrases (corresponding to the first semantic tag). In the second sentence, a semantic tag “sit” indicating semantic information of “state, condition, or case” representing a semantic relation between two phrases “
” (book) and “
” (red) is provided between the two phrases (corresponding to the first semantic tag). A semantic tag “sit” indicating semantic information of “state, condition, or case” representing a semantic relation between two phrases “
” (book) and “
” (cheap) is also provided between the two phrases (corresponding to the first semantic tag). A semantic tag “par” indicating semantic information of “parallel relation” representing a semantic relation between two phrases “
” (red) and “
” (cheap) is provided between the two phrases 1(corresponding to the second semantic tag). In the third sentence, a semantic tag “obj” indicating semantic information of “object of transitive” representing a semantic relation between two phrases “
” (it) and “
” (bought) is provided between the two phrases, and a semantic tag “tim” indicating semantic information of “temporal position” representing a semantic relation between two phrases “
” (immediately) and “
”(bought) is also provided between the two phrases (both corresponding to the first semantic tag). A semantic tag “eq” indicating semantic information of “equivalent” representing a semantic relation between the phrase “
” (book) in the first sentence and the phrase “
” (book) in the second sentence is provided between the two phrases based on the context-syntax data D3 in the present example (corresponding to the second semantic tag but not corresponding to the first semantic tag). A semantic tag “corr” indicating semantic information of “anaphoric relation” representing a semantic relation between the phrase “
” (it) in the third sentence and the phrase “
” (book) in the second sentence is provided between the two phrases based on the context-syntax data D3 in the present example (corresponding to the second semantic tag but not corresponding to the first semantic tag). A semantic tag “agt” indicating semantic information of “behavior, acting subject having intention” representing a semantic relation between the phrase “
” (I) in the first sentence and the phrase “
” (bought) in the third sentence is provided between the two phrases based on the context-syntax data D3 in the present example (corresponding to the first semantic tag).
In the semantic representation data 140, the third sentence “” ((I) bought it immediately.) is determined to fall under “result” based on the context-syntax data D3 in the present example. A semantic tag “rea” indicating semantic information representing a semantic relation (cause) between the phrase “
” (found) corresponding to a predicative of the first sentence and the phrase “
” (bought) corresponding to a predicative of the third sentence is provided between those two phrases as illustrated in
” (helpful) in the first sentence and the phrase “
” (bought) in the third sentence, between the phrase “
” (red) in the second sentence and the phrase “
” (bought) in the third sentence, and between the phrase “
” (cheap) in the second sentence and the phrase “
” (bought) in the third sentence is also provided between those phrases. These semantic tags “rea” not correspond to the first semantic tag, but correspond to the second semantic tag.
<6. Effect>
According to the present embodiment described above, the concept tag (CT) is provided to not only the morpheme of the content word such as a noun or a verb but also the morpheme of the function word such as a postposition (refer to
According to the present embodiment, the semantic analysis (
When the semantic representation data 140 generated by such a present embodiment is used for obtaining knowledge from a natural language data and a question answering system by a natural language, a degree of accuracy of obtaining the knowledge and reusability of the obtained knowledge can be increased.
<7. Modification Example>
The present invention is not limited to the embodiment described above, however, various modification can be performed within a scope of the present invention,
For example, in the embodiment described above, the input text data Din for generating the semantic representation data 140 is the text data described in Japanese. However, the semantic representation data 140 can be generated from text data of the other natural language such as the input text data Din as text data described in English, for example, by a semantic representation generation device or a semantic representation generation method made up in a manner similar to the semantic representation generation device or the semantic representation generation method according to the embodiment described above.
In the CT system table 33 used in the embodiment described above, as illustrated in
While the invention has been shown and described in detail, the foregoing description is in all aspects illustrative and not restrictive. It is therefore understood that numerous modifications and variations can be devised without departing from the scope of the invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2022-076454 | May 2022 | JP | national |