The technology disclosed in this DESCRIPTION relates to a synonym calculation technology and a semantic representation generating technology.
The technologies for computers to obtain knowledge from documents described in natural languages have been developed. To obtain the knowledge, it is necessary to calculate a similarity between words in a document with high accuracy and use the similarity for creating a dictionary (see, for example, Japanese Patent Application Laid-Open No. 2019-121044).
The conventional methods of calculating a similarity between words include a method of calculating a difference between directions of vectors defined for words as a cosine similarity.
The method of calculating the cosine similarity has, however, a problem of increase in a similarity between words as the words more often appear in the document. Thus, the accuracy of calculating the similarity is sometimes insufficient.
The present invention is directed to a similarity calculation apparatus, a semantic representation generating apparatus, a recording medium, and a similarity calculation method.
One aspect of the present invention is a similarity calculation apparatus that calculates a similarity between words, the apparatus including: a distance calculator calculating a distance between vectors defined for the words; and a determination part determining a pair of the words corresponding to the vectors to be synonyms when the distance between the vectors is smaller than or equal to a predefined distance, wherein the distance calculator calculates, as the distance between the vectors, a curvilinear surface distance between the vectors in a hyperbolic space in which the vectors are defined.
Defining each of the vectors in the hyperbolic space and calculating the distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
One aspect of the present invention is a semantic representation generating apparatus that generates semantic representation data from natural language information, the apparatus including: a concept tag system table recording concept information hierarchically and ambiguously representing meanings of morphemes of a group of word classes in a natural language; a text analysis part receiving text data described in the natural language and performing a superficial analysis including a syntax analysis on the text data to generate syntax data representing a structure of a sentence included in the text data; and a semantic analysis part generating the semantic representation data corresponding to the text data, based on the syntax data, wherein the concept tag system table is updated based on the pair of words associated with each other in the synonym dictionary, the text analysis part provides a concept tag to each of the morphemes included in the text data, based on the syntax data with reference to the concept tag system table, the concept tags indicating the concept information hierarchically representing the meanings of the morphemes, the semantic analysis part provides a semantic tag to a pair of a phrase or a sequence of phrases which correspond to a predicate and an other phrase or an other sequence of phrases which have a modification relation with the predicate, based on the syntax data, the semantic tag indicating semantic information representing a semantic relation between the pair, the phrase or the sequence of phrases and the other phrase or the other sequence of phrases being included in the text data, and the semantic analysis part generates the semantic representation data, based on the concept tag provided to each of the morphemes included in the text data and the semantic tag provided to the pair of the phrase or the sequence of phrases and the other phrase or the other sequence of phrases that are included in the text data.
Defining each of the vectors in the hyperbolic space and calculating the distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
One aspect of the present invention is a non-transitory recording medium storing a similarity calculation program including a plurality of commands to be executed by one or more processors and executable by a computer, the similarity calculation program when being installed in the computer and executed by the computer functioning as, using the plurality of commands executed by the one or more processors, the following steps including: causing the computer to calculate a distance between vectors defined for words; and causing the computer to determine a pair of the words corresponding to the vectors to be synonyms when the distance between the vectors is smaller than or equal to a predefined distance, wherein the distance between the vectors is a curvilinear surface distance between the vectors in a hyperbolic space in which the vectors are defined.
Defining each of the vectors in the hyperbolic space and calculating the distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
One aspect of the present invention is a similarity calculation method of calculating a similarity between words, the method including: calculating a distance between vectors defined for the words; and determining a pair of the words corresponding to the vectors to be synonyms when the distance between the vectors is smaller than or equal to a predefined distance, wherein the calculating includes calculating, as the distance between the vectors, a curvilinear surface distance between the vectors in a hyperbolic space in which the vectors are defined.
Defining each of the vectors in the hyperbolic space and calculating the distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
Thus, the object of the present invention is to increase the accuracy of calculating a similarity between words.
These and other objects, features, aspects, and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.
Embodiments will be hereinafter described with reference to the attached drawings. Although detailed features are described in Embodiments below for describing the technology, they are mere exemplification and not necessarily essential features for making Embodiments feasible.
Note that the drawings are drawn in schematic form, and configurations in the drawings are appropriately omitted or simplified for convenience of the description. The mutual relationships in size and position between the configurations in the different drawings are not necessarily accurate but may be appropriately changed. The drawings such as plan views except cross-sectional views are sometimes hatched for facilitating the understanding of the details of Embodiments.
In the following description, the same reference numerals are assigned to the same constituent elements, and their names and functions are the same. Therefore, detailed description of such constituent elements may be omitted to avoid redundant description.
Unless otherwise specified, an expression “comprising”, “including”, or “having” a certain constituent element is not an exclusive expression for excluding the presence of the other constituent elements in the following description.
Even when the ordinal numbers such as “first” and “second” are used in the following description, these terms are used for convenience to facilitate the understanding of the details of Embodiments. The order indicated by these ordinal numbers does not restrict the details of Embodiments.
A similarity calculation apparatus, a similarity calculation program, and a similarity calculation method according to Embodiment 1 will be hereinafter described.
As illustrated in the example of
The distance calculator 12 calculates a distance between vectors, based on pieces of multidimensional vector data 2 on a plurality of input words (e.g., nouns), and outputs distance data 4 indicating a distance between the vectors. A method of calculating a distance between vectors will be described later.
The determination part 14 determines whether a pair of words corresponding to the vectors between which the distance is indicated by the distance data 4 are synonyms, based on a threshold in a storage 6 (this operation and the aforementioned operation of the distance calculator 12 are collectively referred to as similarity calculation processes). Then, the determination part 14 registers the pair of words determined to be synonyms, in a synonym dictionary 8.
The computer 40 illustrated in
The communication interface device 46 is an interface circuit for wired communication or wireless communication. The communication interface device 46 can communicate with, for example, the synonym dictionary 8 illustrated in
In the computer 40 with such a configuration, the auxiliary storage 43 stores the similarity calculation program 131. The similarity calculation program 131 is a program for implementing operations of the distance calculator 12 and the determination part 14 in
The similarity calculation program 131 may be received from a server or another computer through the communication interface device 46, or may be read from the recording medium 30 through the recording medium reading device 47.
The CPU 41 executes the similarity calculation program 131 stored in the auxiliary storage 43, using the main memory 42 as an operation memory to perform the similarity calculation processes on input text data Din stored in the main memory 42. When the CPU 41 performs the similarity calculation processes, the computer 40 functions as the similarity calculation apparatus 1.
The aforementioned configuration of the computer 40 is only one example. Thus, the similarity calculation apparatus 1 can be implemented by various computers.
Next, operations of the similarity calculation apparatus 1 according to Embodiment 1 will be described with reference to
First, the distance calculator 12 receives the pieces of multidimensional vector data 2 on a plurality of words (Step S01). Here, the piece of multidimensional vector data 2 corresponding to each of the words is defined by multidimensional Euclidean coordinates in advance.
Next, the distance calculator 12 calculates a distance between a plurality of vectors based on the pieces of multidimensional vector data 2 (Step S02).
For example, assume that a vector Wi of a word “” (sangaku/mountains) and a vector Wj of a word “” (shigai/an urban area) to be used for calculating a similarity are expressed as below.
Here, a similarity is calculated in a conventional method of calculating a cosine similarity as below. The similarity becomes higher as the value is closer to 1.
In the calculating method indicated by Equation (3), however, each vector is defined in a flat space, and an inverse document frequency (i.e., IDF) of a word is disregarded. Thus, the calculating method has a problem of obtaining an inaccurate similarity between words in a curved space, and a problem of increasing a similarity between words as the words more often appear in a document.
In Embodiment 1, each of vectors is defined in a hyperbolic space, and a distance between the vectors is calculated in consideration of a curvature of a curved surface of the hyperbolic space. This reduces the problems (measurement ambiguity caused by disregarding the curved surface and the IDF) in the conventional method, and increases the accuracy of calculating the similarity.
A specific method is multiplying a distance between Euclidean coordinates of vectors by a curvature of a curved surface between the vectors to calculate a curvilinear surface distance (geodetic line) and obtaining the curvilinear surface distance as the distance between the vectors.
As illustrated in the example of
The calculation of the distance X2 as the curvilinear surface distance (geodetic line) in Embodiment 1 calculates a distance between the vectors with high accuracy, and consequently increases the accuracy of calculating a similarity between words.
A distance between Euclidean coordinates of the vector Wi and the vector Wj is 62, based on Equations (1) and (2).
Since a curvature of a curved surface between the vectors Wi and Wj corresponds to a norm of a vector product of the vectors, the curvature is calculated as below based on Equations (1) and (2).
Thus, the curvilinear surface distance (geodetic line) between the vectors in consideration of the curvature of the curved surface of the hyperbolic space is calculated to be 62× 46.2=2864.4. The curvilinear surface distance between the vectors in consideration of the curvature of the curved surface of the hyperbolic space means that the larger the value of the curvilinear surface distance is, the more distant the vectors are. This indicates that the similarity between the corresponding words is low.
The following will describe another example of calculating a distance with another vector, both in the conventional method of calculating a cosine similarity and in the method of calculating a curvilinear surface distance (geodetic line) between the vectors in consideration of a curvature of a curved surface of a hyperbolic space according to Embodiment 1.
Here, a vector Wk of a word “” (kasen/rivers) is expressed as below.
Here, a cosine similarity between the vector Wi and the vector Wk is calculated to be 43/(9.899×4.583)=0.948.
In contrast, a distance between Euclidean coordinates of the vector Wi and the vector Wk is 43 based on Equations (1) and (6).
Since a curvature of a curved surface between the vectors Wi and Wk corresponds to a norm of a vector product of the vectors, the curvature is 14.457 based on Equations (1) and (6).
Thus, the distance between the vectors in consideration of the curvature of the curved surface of the hyperbolic space is calculated to be 43×14.457=621.651.
The aforementioned results show that a difference between the calculated values of the cosine similarity between the vector Wi and the vector Wj and the cosine similarity between the vector Wi and the vector Wk in the conventional method is not large.
In contrast, a difference between values of the distance (similarity) between the vector Wi and the vector Wj and the distance (similarity) between the vector Wi and the vector Wk is large when the values are calculated in consideration of the curvature of the curved surface of the hyperbolic space.
This is ascribable to a wider range of values that can be indicated by distances between vectors in consideration of a curvature of a curved surface of a hyperbolic space than a range of values that can be indicated by cosine similarities. This reason will contribute to increasing the accuracy of calculating the similarity. In other words, the method of calculating a distance between vectors in consideration of a curvature of a curved surface of a hyperbolic space according to Embodiment 1 can increase the accuracy of calculating a similarity between words.
Next, the distance calculator 12 outputs the calculated distance between the plurality of vectors as the distance data 4 (Step S03).
Then, the determination part 14 compares a distance between the vectors indicated by the distance data 4 and a reference distance (a threshold of the distance) stored in the storage 6 in advance. When the distance between the vectors indicated by the distance data 4 is smaller than or equal to the reference distance, the determination part 14 determines a pair of words corresponding to the vectors between which the distance is indicated by the distance data 4 to be synonyms (Step S04).
Then, the determination part 14 registers the pair of words determined to be synonyms, in the synonym dictionary 8 (Step S05). The determination part 14 does not register a pair of words determined not to be synonyms, in the synonym dictionary 8.
In Embodiment 1, defining each of vectors in a hyperbolic space and calculating a distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can increase the accuracy of calculating a similarity between words corresponding to the vectors.
A semantic representation generating apparatus according to Embodiment 2 will be described. In the following description, the same reference numerals are assigned to the same constituent elements as those in Embodiment 1, and the detailed description will be appropriately omitted.
In establishing a knowledge base from the natural language data and implementing a question answering system in natural languages, generating semantic representation data capable of sufficiently representing meanings of words and a meaning of a sentence in natural language data is important to increase the accuracy of obtaining knowledge.
The technology for generating such semantic representation data will be described.
A semantic representation generating apparatus 10 generates semantic representation data from natural language data (text data such as a document described in a natural language), and is implemented by causing a computer to execute a semantic representation generating program.
As illustrated in the example of
With reference to the synonym dictionary 8 generated by the similarity calculation apparatus 1 according to Embodiment 1, the updating part 52 updates, for example, definitions of words in the CT system table 33, the ST system table 34, and the morpheme dictionary 35 on a word extracted from a new document, or newly adds the word to the CT system table 33, the ST system table 34, and the morpheme dictionary 35.
The natural language analysis part 110 includes a morpheme analysis part 112, a syntax analysis part 114, a context analysis part 116, and a semantic analysis part 118. In Embodiment 1, text data to be analyzed by the semantic representation generating apparatus 10 is text data described in Japanese, and is stored in a text data storage 100 provided outside of the semantic representation generating apparatus 10.
The natural language analysis part 110 reads the text data that is natural language data to be analyzed, from the text data storage 100. In the natural language analysis part 110, the morpheme analysis part 112 first performs a morpheme analysis on the read text data (referred to as “input text data” hereinafter) Din to generate data in which the input text data is separated for each morpheme (referred to as “spaced-writing data”) D1.
In this morpheme analysis, a word class or an inflected form of each of the morphemes included in the spaced-writing data D1 is also determined. In the morpheme analysis, a concept tag (also referred to as “CT” hereinafter) is provided to each of the morphemes in the spaced-writing data D1 with reference to the CT system table 33.
The syntax analysis part 114 performs a syntax analysis on the spaced-writing data D1 obtained as a result of the morpheme analysis to generate syntax data D2 representing a structure (a dependency structure and a phrase structure) of each sentence included in the input text data Din.
The context analysis part 116 performs a context analysis on the input text data Din based on the syntax data D2 described above to identify an antecedent referenced by an anaphor included in the input text data Din and identify a pair of sentences having a discourse relation in the input text data Din, thus generating context data representing an anaphoric relation and the discourse relation in the input text data Din, and outputting context-syntax data D3 made up of the context data and the syntax data D2 described above. The morpheme analysis part 112, the syntax analysis part 114, and the context analysis part 116 may be collectively referred to as a “text analysis part”.
The semantic analysis part 118 provides, based on the context-syntax data D3, a semantic tag (also referred to as “ST” hereinafter) indicating semantic information representing a semantic relation between a phrase or a sequence of phrases and another phrase or another sequence of phrases (also referred to as a “phrase-sequence-of-phrases pair” hereinafter) having a modification relation in the input text data Din, to the pair with reference to the ST system table 34. The semantic analysis part 118 then generates semantic representation data 140 corresponding to the input text data Din, based on the concept tag provided to each morpheme included in the spaced-writing data D1 and the semantic tag provided to the phrase-sequence-of-phrases pair in the context-syntax data D3. A semantic tag is provided also between a sentence and a sentence having a discourse relation, which will be described later.
The computer 20 illustrated in
The communication interface device 26 is an interface circuit for wired communication or wireless communication. The recording medium reading device 27 is an interface circuit for the recording medium 30 storing, for example, a program. The recording medium 30 is, for example, a non-transitory recording medium such as a CD-ROM, a DVD-ROM, or an USB memory.
In the computer 20 having the above configuration, the auxiliary storage 23 stores text data 32 to be analyzed, the CT system table 33, the ST system table 34, and the morpheme dictionary 35 as well as the semantic representation generating program 31 according to Embodiment 2. Storing the text data 32 in the auxiliary storage 23 implements the text data storage 100 for the semantic representation generating apparatus 10 of
The semantic representation generating program 31, the text data 32, the CT system table 33, the ST system table 34, and the morpheme dictionary 35 may be received from a server or another computer through the communication interface device 26, or may be read from the recording medium 30 through the recording medium reading device 27.
When the semantic representation generating program 31 is executed by the computer 20, the main memory 22 loads the semantic representation generating program 31, and partially or wholly loads the text data 32 as the input text data Din.
The CPU 21 executes the semantic representation generating program 31 stored in the main memory 22, using the main memory 22 as an operation memory to perform semantic representation generating processes on the input text data Din stored in the main memory 22.
Through the semantic representation generating processes, the semantic representation data 140 corresponding to the input text data Din is generated. When the CPU 21 performs the semantic representation generating processes, the computer 20 functions as the semantic representation generating apparatus 10.
The aforementioned configuration of the computer 20 is only one example. Thus, the semantic representation generating apparatus 10 can be implemented by various computers.
In Embodiment 2, the CT system table 33, the ST system table 34, and the morpheme dictionary 35 are prepared in advance, and are stored in the auxiliary storage 23 as described above (
As illustrated in
Update of the CT system table 33, the ST system table 34, and the morpheme dictionary 35 based on a similarity between words calculated in consideration of a curvature of a curved surface of a hyperbolic space (a synonyms, a near-synonym, or a co-occurrence relationship between morphemes) can increase the accuracy of definitions of each of the words, and also the accuracy of the semantic representation generating processes to be described later.
The concept information is recorded in the CT system table 33. The concept information hierarchically and ambiguously expresses meanings of morphemes of all word classes in Japanese as a natural language, that is, morphemes of content words such as nouns, verbs, and adjectives as well as morphemes of function words such as postpositions and auxiliary verbs.
As illustrated in the example of
For example, a postposition “” (ni) is recorded to indicate “state”, “operation source”, or “causal reason” as a concept representing the meaning, and with “other party” as a high-level concept. In other words, the concept information hierarchically and ambiguously representing meanings of the postposition “” (ni) is recorded.
The ST system table 34 is a table associating a rule for determining a pair of phrases to which one of a plurality of semantic tags should be provided with each of the semantic tags. The semantic tags indicate pieces of semantic information each representing a semantic relation between a phrase and a phrase in Japanese as the natural language.
As illustrated in the example of
In other words, according to the determination method (ST provision rule) corresponding to the semantic tag “agt” in the ST system table 34 in
The predetermined concept tag (predetermined CT) herein is specifically selected based on the CT system table 33 in accordance with semantic information representing a semantic relation of a pair of phrases to which this semantic tag “agt” should be provided. The same applies to a “predetermined CT” used for defining a determination method (ST provision rule) corresponding to another semantic tag. An appropriate CT is selected based on the CT system table 33 in accordance with semantic information representing a semantic relation of a pair of phrases to which the semantic tag should be provided.
The morpheme dictionary 35 describes a word class of each morpheme, a detailed word class of the morpheme, and a relationship with hierarchical classification in the CT system table 33 (described in numbers in
As described above, the CPU 21 in the computer 20 executes the semantic representation generating program 31 to perform the semantic representation generating processes on text data of a natural language as an analysis target document.
In the description hereinafter, the morpheme analysis, the syntax analysis, and the context analysis will be collectively referred to as “text analysis”. In Embodiment 2, the CPU 21 executes the semantic representation generating program 31, so that the computer 20 operates as illustrated in the examples of
As illustrated in the example of
Next, the morpheme analysis is performed on the input text data Din (Step S12). As illustrated in the example of
Subsequently, the concept tag (CT) is provided to each of the morphemes in the input text data Din with reference to the CT system table 33 (Step S124). As described above, the CT system table 33 records the concept information hierarchically and ambiguously representing the meanings of the morphemes used in the natural language (see
Provision of the concept tag to each morpheme in the input text data Din will be described with reference to
According to Steps S122 and S124, this text is divided into seven morphemes, and the concept tag (CT) is provided to each of the morphemes as illustrated in
Next, based on the delimiters between the morphemes and provision of the word class and the concept tag to each of the morphemes in the input text data Din in Steps S122 and S124, the spaced-writing data D1 corresponding to the input text data Din is generated (Step S126). When the spaced-writing data D1 is generated, the morpheme process (Step S12) is finished, and the processes proceed to Step S14 in
In the syntax analysis (Step S14) in the example of
According to these Steps S142 and S144, for example, the dependency structure and the phrase structure are obtained as illustrated in
Subsequently, this syntax analysis generates the syntax data D2 representing a structure of each sentence included in the input text data Din (the dependency structure and the phrase structure), based on the dependency structure and the phrase structure obtained as described above (Step S146). When the syntax data D2 is generated, the syntax analysis (Step S14) is finished, and the processes proceed to Step S16 in
As illustrated in the example of
Usage of the context-syntax data D3 obtained by such anaphoric analysis and discourse structure analysis will be described in relation to second and third generation examples of the semantic representation data to be described later (see
As illustrated in
Hereinafter, phrases, sequences of phrases, and sentences will be collectively referred to as “text constituent elements”. This description will proceed, assuming that a pair of text constituent elements having a modification relation or a discourse relation are semantically related to each other.
When the semantic tag is provided to the pair of text constituent elements included in the input text data Din in Step S182, the semantic representation data 140 corresponding to the input text data Din is generated based on the concept tag provided to each morpheme in the input text data Din and the semantic tag provided to the pair of text constituent elements having the semantic relation in the input text data Din (Step S184). For example, the semantic representation data illustrated in the example of
In the computer 20 as the semantic representation generating apparatus 10 according to Embodiment 1, the semantic representation data 140 of an appropriate data structure (a data structure appropriate for processes in the computer 20) corresponding to the semantic representation illustrated in
In the semantic representation illustrated in
After the semantic representation data 140 is generated in Step S184, the semantic analysis (Step S18) is finished. As illustrated in
Processes of a well-known method or a publicly known method may be adopted to the specific processes of the morphological analysis (
In this example, the text in
With a focus on morphemes of postpositions “” (wa) and “” (to) in this text, concept information provided on “” (wa) in the CT system table 33 does not hierarchically express a meaning thereof, however, concept information provided on “” (to) in the CT system table 33 hierarchically and ambiguously represents a meaning thereof. In other words, as illustrated in
As illustrated in
Next, a dependency structure and a phrase structure of the text in this example (
After the context analysis (Step S16), the semantic analysis (Step S18) is performed. This semantic analysis provides semantic tags (e.g., “agt”, “jnt”, and “pur”) to the pairs of the text constituent elements semantically related to each other in the text of this example (herein, three pairs of phrases each having a modification relation). The semantic representation data as illustrated in the example of
In this semantic representation data, the semantic tag (“agt” or “agt, exp”) provided between the two phrases “” (Taro wa/Taro) and “” (itta/went) is the same as that in the example illustrated in
In this example, each of a first sentence “” (Dennetsusen ga kouon datta/A heating wire was hot.) and a second sentence “” (Dennetsusen ga nanka shita/A heating wire was softened.) in the text of
After the syntax analysis (Step S14) and the context analysis (Step S16), the semantic analysis (Step S18) is performed. This semantic analysis provides semantic tags (“gnr” and “cap”) to the pairs of the text constituent elements (a pair of phrases having a modification relation in the first sentence and a pair of phrases having a modification relation in the second sentence) semantically related to each other in the text of this example (
In this semantic representation data, the semantic tag “gnr” indicating semantic information “general relation” representing a semantic relation between the two phrases “” (Dennetsusen ga/A heating wire) and “” (kouon datta/was hot) in the first sentence is provided between the two phrases based on a determination method (an ST provision rule for determining the semantic tag which should be provided between a phrase and a phrase) in the ST system table 34 illustrated in
In the semantic representation data, the second sentence “” (Dennetsusen ga nanka shita/A heating wire was softened.) is determined to fall under “result” based on the context-syntax data D3 in this example. As illustrated in the example of
In this example, the semantic analysis (Step S18) is performed on the text of a first sentence “” (Watashi wa sankou ni naru hon wo honya de mitsuketa/I found a helpful book in a book store.), a second sentence “” (Hon wa akairo de yasukatta/The book was red and cheap.), and a third sentence “” (Sore wo sassoku katta/(I) bought it immediately.) in
In the semantic representation data, a semantic tag “sit” indicating semantic information “state, condition, or case” representing a semantic relation between two phrases “” (sankou ni naru/helpful) and “” (hon wo/book) is provided between the two phrases in the first sentence, based on the determination method (the ST provision rule for determining the semantic tag which should be provided between the phrases) of the ST system table 34 illustrated in
In the second sentence, the semantic tag “sit” indicating semantic information “state, condition, or case” representing a semantic relation between two phrases “” (Hon wa/book) and “” (akairo de/is red) is provided between the two phrases. The semantic tag “sit” indicating semantic information “state, condition, or case” representing a semantic relation between two phrases “” (Hon wa/book) and “” (yasukatta/is cheap) is also provided between the two phrases. A semantic tag “par” indicating semantic information “parallel relation” representing a semantic relation between two phrases “” (akairo de/is red) and “” (yasukatta/is cheap) is provided between the two phrases.
In the third sentence, the semantic tag “obj” indicating semantic information “object of transitive” representing a semantic relation between two phrases “” (Sore wo/it) and “” (katta/bought) is provided between the two phrases. Furthermore, a semantic tag “tim” indicating semantic information “temporal position” representing a semantic relation between two phrases “” (sassoku/immediately) and “” (katta/bought) is also provided between the two phrases.
A semantic tag “eq” indicating semantic information “equivalent” representing a semantic relation between the phrase “” (hon wo/book) in the first sentence and the phrase “” (hon wa/book) in the second sentence is provided between the two phrases based on the context-syntax data D3 in this example. A semantic tag “corr” indicating semantic information “anaphoric relation” representing a semantic relation between the phrase “” (Sore wo/it) in the third sentence and the phrase “” (hon wa/book) in the second sentence is provided between the two phrases based on the context-syntax data D3 in this example. The semantic tag “agt” indicating semantic information “doer, acting subject having intention” representing a semantic relation between the phrase “” (Watashi wa/I) in the first sentence and the phrase “” (katta/bought) in the third sentence is provided between the two phrases based on the context-syntax data D3 in this example.
In this semantic representation data, the third sentence “” (Sore wo sassoku katta/(I) bought it immediately.) is determined to fall under “result” based on the context-syntax data D3 in this example. A semantic tag “rea” (reason) indicating semantic information representing a semantic relation between the phrase “” (mitsuketa/found) corresponding to a predicate of the first sentence and the phrase “” (katta/bought) corresponding to a predicate of the third sentence is provided between those two phrases as illustrated in
In the similar manner, the semantic tag “rea” (reason) indicating semantic information representing a semantic relation between the phrase “” (sankou ni naru/helpful) in the first sentence and the phrase “” (katta/bought) in the third sentence, between the phrase “” (akairo de/is red) in the second sentence and the phrase “” (katta/bought) in the third sentence, and between the phrase “” (yasukatta/is cheap) in the second sentence and the phrase “” (katta/bought) in the third sentence is also provided between those phrases.
Next, example advantages produced by Embodiments above will be described. Although the advantages will be described based on the specific structures whose examples are described in Embodiments above, the structures may be replaced with another specific structure whose example is described in this DESCRIPTION as long as it produces the same advantages. Specifically, although only one of the specific structures is sometimes described as a representative for convenience, the structure may be replaced with another specific structure associated with the structure described as the representative.
Such replacement may be performed across a plurality of Embodiments. Specifically, such replacement may be performed when combined structures whose examples are described in different Embodiments produce the same advantages.
According to Embodiments above, the similarity calculation apparatus 1 includes the distance calculator 12 and the determination part 14. The distance calculator 12 calculates a distance between vectors defined for words. The determination part 14 determines a pair of the words corresponding to the vectors to be synonyms when the distance between the vectors is smaller than or equal to a predefined distance. The distance calculator 12 calculates, as the distance between the vectors, a curvilinear surface distance between the vectors in a hyperbolic space in which the vectors are defined.
Under this configuration, defining each of the vectors in the hyperbolic space and calculating the distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can avoid a problem that disregards an IDF, and increase the accuracy of calculating the similarity. Since a range of values that can be indicated by distances between vectors in consideration of a curvature of a curved surface of a hyperbolic space is wider than a range of values that can be indicated by cosine similarities under the configuration, the accuracy of calculating the similarity can be increased.
When the other structures whose examples are described in the DESCRIPTIPTION are appropriately added to the structure above, that is, the other structures in the DESCRIPTIPTION which are not mentioned as the structure above are appropriately added, the same advantages can be produced.
According to Embodiments above, each of the vectors is described by Euclidean coordinates. The curvilinear surface distance is a value obtained by multiplying a distance between the Euclidean coordinates of the vectors by a curvature of a curved surface between the vectors. Under the structure, calculating the distance between the vectors in consideration of the curvature of the curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
According to Embodiments above, the curvature is represented by a norm of a vector product of the vectors. Under the structure, calculating the distance between the vectors in consideration of the curvature of the curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
According to Embodiments above, the determination part 14 associates the pair of words determined to be the synonyms with each other, and registers the pair of words in the synonym dictionary 8. Under this configuration, registering a pair of words whose similarity has been calculated with high accuracy in the synonym dictionary 8 can define the words with high accuracy when the morpheme dictionary 35, the CT system table 33, and the ST system table 34 are updated using the synonym dictionary 8.
According to Embodiments above, the semantic representation generating apparatus 10 includes the CT system table 33, the text analysis part, and the semantic analysis part 118. The CT system table 33 records concept information hierarchically and ambiguously representing meanings of morphemes of a group of word classes in a natural language. The text analysis part receives text data described in the natural language, and performs a superficial analysis including a syntax analysis on the text data to generate syntax data representing a structure of a sentence included in the text data. The semantic analysis part 118 generates the semantic representation data corresponding to the text data, based on the syntax data. The CT system table 33 is updated based on the pair of words associated with each other in the synonym dictionary 8. The text analysis part provides a concept tag to each of the morphemes included in the text data, based on the syntax data with reference to the CT system table 33, the concept tags indicating the concept information hierarchically representing the meanings of the morphemes. The semantic analysis part 118 provides a semantic tag to a pair of a phrase or a sequence of phrases which correspond to a predicate and an other phrase or an other sequence of phrases which have a modification relation with the predicate, based on the syntax data, the semantic tag indicating semantic information representing a semantic relation between the pair, the phrase or the sequence of phrases and the other phrase or the other sequence of phrases being included in the text data. The semantic analysis part 118 generates the semantic representation data, based on the concept tag provided to each of the morphemes included in the text data and the semantic tag provided to the pair of the phrase or the sequence of phrases and the other phrase or the other sequence of phrases that are included in the text data.
This configuration can provide, in the morpheme analysis (
Furthermore, the semantic analysis (
When the semantic representation data generated by this configuration is used for obtaining knowledge from natural language data or for a question answering system by a natural language, the accuracy of obtaining the knowledge and the reusability of the obtained knowledge can be increased.
According to Embodiments above, the similarity calculation program causes a computer to calculate a distance between vectors defined for words. The similarity calculation program causes the computer to determine a pair of the words corresponding to the vectors to be synonyms when the distance between the vectors is smaller than or equal to a predefined distance. The distance between the vectors is a curvilinear surface distance between the vectors in a hyperbolic space in which the vectors are defined.
Under this configuration, calculating the distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
The program may be recorded in a computer-readable removable recording medium (a non-transitory recording medium) such as a magnetic disc, a flexible disk, an optical disk, a compact disk, a Blu-Ray disc (trademark), or a DVD. A removable recording medium in which a program that performs the aforementioned functions has been recorded may be commercially distributed.
According to Embodiments above, a similarity calculation method includes calculating a distance between vectors defined for words. The method includes determining a pair of the words corresponding to the vectors to be synonyms when the distance between the vectors is smaller than or equal to a predefined distance. The calculating includes calculating, as the distance between the vectors, a curvilinear surface distance between the vectors in a hyperbolic space in which the vectors are defined.
Under this configuration, calculating the distance between the vectors in consideration of a curvature of a curved surface of the hyperbolic space can increase the accuracy of calculating the similarity.
Although Embodiments described above specify dimensions, shapes, relative arrangement relationships, and conditions for implementation of each of the constituent elements, these are in all aspects illustrative and are not restrictive of Embodiments.
Therefore, numerous modifications and equivalents that have not yet been exemplified will be devised within the scope of the technologies disclosed in the DESCRIPTIPTION. Examples of the numerous modifications and equivalents include a case where at least one constituent element is modified, added, or omitted, and further a case where at least one constituent element in at least one of Embodiments is extracted and combined with a constituent element in the other Embodiment.
In Embodiments, the input text data Din for generating the semantic representation data is text data described in Japanese. However, a semantic representation generating apparatus or a method of generating semantic representation that are constructed similarly to those according to Embodiments can generate the semantic representation data from text data of another natural language, for example, input text data Din that is text data in English.
As illustrated in
The ST provision rule corresponding to each semantic tag in the ST system table 34 according to Embodiments may be defined in a form different from those illustrated in the examples of
Furthermore, the ST system table 34 according to Embodiments provides the ST provision rule to a pair of text constituent elements (phrases, sequences of phrases, or sentences) semantically related in the natural language, regardless of whether the text constituent elements are text constituent elements of phrases corresponding to a predicate. Instead, the ST provision rule may be provided to a pair of text constituent elements only when one of the pair of text constituent elements semantically related in the natural language is a phrase corresponding to a predicate as with conventional provision of a deep case etc.
While the invention has been shown and described in detail, the foregoing description is in all aspects illustrative and not restrictive. It is therefore understood that numerous modifications and variations can be devised without departing from the scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2023-041957 | Mar 2023 | JP | national |