1. Field of the Invention
The present invention relates to data processing techniques and particularly to a data processing apparatus, method, and program for evaluating a path of a weighted directed graph.
2. Description of the Related Art
A program is widely used for converting a reading input by a user into a Kanji character in inputting a Japanese character string (for example, see patent document 1).
[Patent document 1] Japanese Laid-Open Publication No. 2004-139402
In order to improve the accuracy of converting the reading of text, which is input by a user, into text including Kanji characters, the inventor and others have developed a technique for selecting the best conversion candidate by: referring to a Kanji conversion dictionary; generating a directed graph constituted with a word including a Kanji character based on the reading of the input text; assigning scores to nodes of the directed graph, i.e., words, and to an edge between the nodes, i.e., the way the words are connected; and solving the optimal path problem of a weighted directed graph.
There is a strong need for a technique for more efficiently calculating the optimal path of a weighted directed graph in order to select a more accurate conversion candidate.
In this background, a purpose of the present invention is to provide a technique for improving the user friendliness for data entry.
An embodiment of the present invention relates to a data processing apparatus. The data processing apparatus comprises: a first weight memory unit operative to store a weight assigned to a node or an edge between two nodes in a directed graph; a second weight memory unit operative, when a weight different from the weight stored in the first weight memory unit is assigned to at least one of nodes or edges included in a combination of specific two or more edges, a combination of three or more nodes, or a combination of two nodes not in a series, to store the weight assigned to a node or edge included in the combination; a directed graph modification unit operative to duplicate, among all nodes included in a target path that includes all nodes or edges included in the combination, a node for which there is a path, other than the target path, that leads to the node and to modify the directed graph so that a node for which the path leading to the node is included in the target path is distinguished from a node for which the path leading to the node is not included in the target path, when the combination is included in the directed graph; and an evaluation unit operative to evaluate a path leading from a first node to a second node in a directed graph modified by the directed graph modification unit based on the weights read out from the first weight memory unit and the second weight memory unit.
The directed graph modification unit may delete, for one of duplicated nodes, an edge not included in the target path among edges leading to the node and delete, for the other one of the duplicated nodes, an edge included in the target path among edges leading to the node.
A weight different from the weight stored in the first weight memory unit may be assigned to the last node of the target path.
The directed graph modification unit may duplicate a node included in a target path that includes all nodes or edges included in the combination and modify the directed graph so that a node for which the path leading to the node is included in the target path, a node for which the path leading to the node is included in the target path but the path leading from the node is not included in the target path, and a node for which the path leading to the node is not included in the target path are distinguished from one another.
The directed graph modification unit may duplicate a node included in a target path that includes all nodes or edges included in the combination and modify the directed graph so that a node for which the path leading from the node is included in the target path, a node for which the path leading from the node is included in the target path but the path leading to the node is not included in the target path, and a node for which the path leading from the node is not included in the target path are distinguished from one another.
Another embodiment of the present invention relates to a data processing method. The data processing method comprises: acquiring a directed graph; assigning a weight to a node or an edge between two nodes in the directed graph; determining, when a weight different from the weight assigned in assigning the weight to the node or the edge between the two nodes in the directed graph is assigned to at least one of nodes or edges included in a combination of specific two or more edges, a combination of three or more nodes, or a combination of two nodes not in a series, whether or not the combination is included in the directed graph; duplicating, among all nodes included in a target path that includes all nodes or edges included in the combination, a node for which there is a path, other than the target path, that leads to the node and modifying the directed graph so that a node for which the path leading to the node is included in the target path is distinguished from a node for which the path leading to the node is not included in the target path, when the combination is included in the directed graph; and evaluating based on the weight a path leading from a first node to a second node in a directed graph to which a path is added.
Optional combinations of the aforementioned constituting elements, and implementations of the invention in the form of methods, apparatuses, and systems may also be practiced as additional modes of the present invention.
Embodiments will now be described, by way of example only, with reference to the accompanying drawings which are meant to be exemplary, not limiting, and wherein like elements are numbered alike in several Figures, in which:
The invention will now be described by reference to the preferred embodiments. This does not intend to limit the scope of the present invention, but to exemplify the invention.
The optimal path problem of a weighted directed graph where weights are assigned to the nodes or the edges between two nodes has an important technical meaning in many fields such as transfer guide for vehicles, workflow management, and natural language processing. Algorithms for solving the optimal path problem of a weighted directed graph includes Viterbi algorithm.
There is only one path from X station, which is a departing station, to A station. Thus, the shortest path from X station to A station is determined. The travel time from X station to A station through the shortest path is “4” minutes. Similarly, there is only one path from X station to B station. Thus, the shortest path from X station to B station is determined, and the travel time thereof is “3” minutes.
The shortest path from X station to C station is now computed. There are two edges A-C and B-C to C station as a final destination. The shortest travel time from X station to C station via A station is obtained to be “13” minutes by adding “4” minutes for the shortest travel time from X station to A station, “1” minute for a connecting time at A station, and “8” minutes for the travel time from A station to C station. Similarly, the shortest travel time from X station to C station via B station is obtained to be “11” minutes by adding “3” minutes for the shortest travel time from X station to B station, “2” minutes for a connecting time at B station, and “6” minutes for the travel time from B station to C station. Therefore, the shortest path from X station to C station goes through B station, and the travel time thereof is “11” minutes.
There is only one edge to D station as a final destination. Therefore, the shortest path from X station to D station goes through A station, and the travel time thereof is “11” minutes. Similarly, the shortest path from X station to E station goes through B station, and the travel time thereof is “12” minutes.
As described above, the shortest path to a given station is obtained by selecting, among one or more edges having the node thereof as their final destination, an edge with the shortest travel time, which is obtained by adding the shortest travel time to the station at the origin of the edge, the connecting time at the station at the origin of the edge, and the travel time from the station at the origin of the edge to the given station.
In a similar manner, the shortest paths and the shortest travel times can be obtained for all the stations to the arriving station Y. The last station before Y station is found to be J station when the edge with the shortest path is obtained among edges having the arriving station Y as a final destination. The shortest path is found by tracking back the last stations in such an order of J station to H station to G station to E station to B station to X station.
A method such as the one described above allows the amount of calculation to be dramatically reduced compared to a method where travel times are computed for all the paths from X station to Y station so as to determine the shortest path. Particularly in the field of natural language processing such as Kanji conversion and morphological analysis, the number of word candidates that constitute text, i.e., the number of nodes, becomes huge as the length of the text becomes longer. Thus, the reduction is necessary in the amount of calculation by employing such a method.
Viterbi algorithm is based on the premise that the optimal path to a given node can be determined independently from the optimal path to the last node before the given node. In other words, the optimal path to a given node can be applied when the optimal path is determined based on the immediate optimal result without going back into the past.
However, in reality, an exceptional condition is sometimes set for a certain combination of edges or nodes. For example, it is assumed that there is a condition where an express train runs for a route from D station to H station via G station and where the travel time from D station to H station via G station is shorter than just the total of the travel time from D station to G station and the travel time from G station to H station. Such an exceptional condition cannot be expressed in the weighted directed graph shown in
As described above, when not only a weight of a node or edge is assigned to a combination of two nodes in a series but also an exceptional weight is assigned to a combination of three or more nodes, a path that passes through the combination is required to be provided separately in order to set exceptional weights to the combination.
In the directed graphs shown in
When the size of an original directed graph is large, for example, for the application to natural language processing, there is a possibility that the number of paths to be added become so large that the amount of calculation dramatically increases, reducing the speed of calculation.
The embodiment provides a technique for efficiently obtaining the optimal path when an exceptional weight is assigned to a combination of two or more edges, a combination of three or more nodes, and a combination of two nodes not in a series in a weighted directed graph.
In this manner, by adding a path assigned an exceptional weight, the increase of the amount of calculation is suppressed to be minimal, and the optimal path can be obtained also in consideration of an exceptional condition where not the last node but previous nodes need to be tracked back and referred to.
In the above example, a path assigned an exceptional weight is added to a directed graph. In this case, there are substantially the same paths in the directed graph in a redundant manner. This is not considered a problem in searching the optimal path. However, since substantially the same path can be reported as a different path in computing a second or later path, there is a possibility that N-best search algorithm fails. Therefore, it is necessary, in performing an N-best search, to modify paths without changing the total number of the paths even when a node or edge is added. A further explanation is given of such an algorithm.
In modifying a directed graph, a programming language, etc., based on predicate logic may be used. In this case, an exceptional weight is assigned as a condition request for a node. The condition request is provided in a procedural (predicate) manner where a truth value or another predicate is returned by using a node as an argument.
For example, it is assumed that an exceptional weight is assigned to a combination of three nodes “A-C-E” in a series in the weighted directed graph shown in
The node C that satisfies the condition request returned by the node E is duplicated, and the node C that satisfies the condition request of the predicate returned by the node C and the node C that does not satisfy the condition request are distinguished from each other. In other words, the node C (C1) preceded by the node “A” and the node C (C2) not preceded by the node “A” are distinguished from each other. Among edges leading to the node C1, an edge “B-C1” not preceded by the node “A” is deleted for the node C1. An edge “A-C2” preceded by the node “A” is deleted for the node C2. Furthermore, the node E (E1) preceded by an edge “A-C” and the node E (E2) not preceded by the edge “A-C” are distinguished from each other in a similar manner. An edge “D-E1” not preceded by the edge “A-C” is deleted for the node E1 in this case. With regard to the node E2, no edges is deleted since neither an edge “C2-E” nor an edge “D-E” is preceded by the edge “A-C.” A directed graph modified in this manner is shown in
Such an algorithm allows the node E preceded by the edge “A-C” and the node E not preceded by the edge “A-C” to be distinguished from each other so that different weights are assigned, without changing the total number of paths. In this case, an exceptional weight that is assigned to the combination of three nodes “A-C-E” in a series is assigned as the weight of the node “E2,” which is the last node of the combination.
Such an algorithm allows for appropriate modification of a directed graph even when an exceptional weight is assigned to a combination of nodes “A-?-E” (note that ? is one arbitrary node). In this case, the node E needs to return a predicate indicating ‘whether the preceding node is arranged in the order “A-?”’ in response to the condition request indicating ‘whether nodes are arranged in the order “A-?-E.”’ In response to this condition request, an arbitrary node preceding the node E needs to return a predicate indicating ‘whether the preceding node is arranged in the order “A.”’
When an exceptional weight is assigned to a combination of nodes “A-*-E” (note that * is one arbitrary node), the node E needs to return a predicate indicating ‘whether the preceding node is arranged in the order “A-*”’ in response to the condition request indicating ‘whether nodes are arranged in the order “A-*-E.”’ In response to this condition request, an arbitrary node, other than the node “A,” needs to return again the predicate indicating ‘whether the preceding node is arranged in the order “A-*.”’ Such a condition request where a wild card is used produces an unlimited combination length as a result. Thus, such a condition request is not used in a conventional method where a fixed length limit is set at the beginning. However, it can be properly used according to the embodiment.
In the above example, nodes are distinguished from one another according to whether or not their preceding paths satisfy a condition. Furthermore, nodes may be distinguished from one another according to whether or not their subsequent paths satisfy a condition. In other words, in the case of analyzing a path in a forward direction, nodes are separated into three types of nodes: (1) a node whose preceding path and subsequent path satisfy a condition; (2) a node whose preceding path satisfies a condition but whose subsequent path does not satisfy a condition; (3) a node whose preceding path does not satisfy a condition. In the case of analyzing a path in a reverse direction, nodes are separated into three types of nodes: (1) a node whose subsequent path and preceding path satisfy a condition; (2) a node whose subsequent path satisfies a condition but whose preceding path does not satisfy a condition; (3) a node whose subsequent path does not satisfy a condition.
For example, in the same manner as the above example, a case is taken into consideration where an exceptional weight is assigned to a combination of three nodes “A-C-E” in a series in the directed graph shown in
Such an algorithm allows a path “A-C-E” to be separated from other paths so that different weights are assigned, without changing the total number of paths. In this case, an exceptional weight that is assigned to the combination of three nodes “A-C-E” in a series may be assigned to an arbitrary node or edge included in the combination.
The input data reception unit 32 receives text data input by a user via the user interface 20. The input data reception unit 32 may acquire text to be analyzed from, for example, other apparatuses or storage media. The directed graph generation unit 34 generates a directed graph having a part of speech of each word as a node from text data received by the input data reception unit 32 in reference to dictionary data stored in the dictionary memory unit 36. The dictionary memory unit 36 stores a dictionary storing spellings of words, parts of speech, the probability of occurrence of the words in each part of speech, and the like in association with one another.
The directed graph generation unit 34 searches a dictionary starting from the first word of the input text, acquires the parts of speech of the words registered in the dictionary, and generates nodes for respective parts of speech. In this example, a “noun” is registered for the word “time” as the part of speech. Thus, a node corresponding to the part of speech is generated. Two parts of speech “noun” and “verb” are registered for the subsequent word “flies.” Thus, two nodes corresponding to the respective parts of speech are generated. In this manner, the words are extracted starting from the beginning, and nodes are generated.
In
Once a directed graph is generated, the optimal combination of parts of speech can be selected by solving the previously-mentioned optimal path problem. In
The first weight memory unit 44 stores a weight assigned to a node or an edge between two nodes in a directed graph.
When a weight, which is different from the weight computed from the weight assigned to a node or edge stored in the first weight memory unit 44, is assigned to a combination of two or more edges, a combination of three or more nodes, or a combination of two nodes not in a series, the second weight memory unit 45 stores the weight assigned to the combination.
The directed graph acquisition unit 41 acquires a directed graph generated by the directed graph generation unit 34. The directed graph memory unit 46 stores the directed graph acquired by the directed graph acquisition unit 41.
When a directed graph stored in the directed graph memory unit 46 includes a combination of nodes or edges for which an exceptional weight is stored in the second weight memory unit 45, the directed graph modification unit 42 modifies the directed graph so that a path going through the nodes or edges included in the combination is distinguished from other paths. The algorithm for modifying the directed graph is as described above.
The evaluation unit 43 evaluates a path leading from a first node to a second node in a directed graph to which a path is added by the directed graph modification unit 42 based on the weights read out from the first weight memory unit 44 and the second weight memory unit 45. The evaluation unit 43 selects the optimal path among multiple paths leading from the first node to the second node based on the weight.
As explained in
When selecting the optimal path from the first node to a given node, the evaluation unit 43 selects, among one or more edges ending at the node, an edge providing the optimal path from the first node to the node based on both the weights assigned to the edges and the weights of the optimal paths from the first node to the originating nodes of the edges. Based on both the weight of the optimal path from the first node to the originating node of a selected edge and the weight assigned to the selected edge or the given node, the weight of the optimal path from the first node to the given node is computed. The weight of a path may be, for example, the addition of the weights assigned to edges and nodes included in the path. The weight may be computed by other arithmetic expressions.
The method of the embodiment allows for flexible setting of a condition since a combination of edges or nodes assigned an exceptional weight can be added to a directed graph even after the directed graph is generated. Even when an exceptional condition is provided, the optimal path can be obtained by Viterbi algorithm. Thus, the amount of calculation and the time required for calculation can be reduced to a large extent. The modification of a path may be performed while the directed graph generation unit 34 generates a direct graph, or it may be performed while the evaluation unit 43 evaluates the weight of a path.
Described above is an explanation based on the embodiments of the present invention. These embodiments are intended to be illustrative only, and it will be obvious to those skilled in the art that various modifications to constituting elements and processes could be developed and that such modifications are also within the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2008-208916 | Aug 2008 | JP | national |