Not Applicable
Not Applicable
The listing main.txt created on Dec. 1, 2010 with size 54,654 bytes contains an implementation of an enhanced operator precedence parser. The implementation language is Prolog. ‘matrix_next’ contains the top level function implementing the algorithm. ‘matrix_select_operators’ selects the operator to apply. ‘matrix_apply_operator_one’ applies the selected operator. ‘selector’ is used to select the arguments. \matrix_apply_evaluator’ applies the selected operator to the selected arguments.
An operator-precedence parser is disclosed which incorporates enhancements that facilitate analysis of human languages. In extant models for operator precedence parsing (such as the Shunting Yard algorithm) an operator is assigned a priority number. That number determines how the parser will build the structure of the expression. The parser applies the operators in order of precedence. Once an operator is applied the result is terminal symbol. Operators are applied highest priority first until a final since value is produced. This is typically used for analyzing arithmetic expressions. Enhancements advanced extent this approach to handle unrestricted natural languages. Enhancements include allowing the result of applying an operator to be another operator; allowing elements to have a priority as an operator and a priority as an operand; allowing operands to have their priority determined by context; allowing a series of priority to be specified for operators. These series of enhancements enabled analysis of sentences that are more complex than can typically be handled by declaration based parsers.
This invention disclosed herein relates to creation of structure data from plain text and more particularly to a processor that creates structured data from plain test using an enhanced operator precedent parser.
There is a strong desire arising to provide every person access to the power of computers. This requires that computer systems that are able to interact using languages natural to users. Typical approaches to analyzing the structure of languages use declarative grammar based-approaches that fail to address complex linguistic structures that are commonly presented.
Time has shown that declarative grammar based approaches such as unification grammars are not easily extended to effectively handle common problem such as conjunction, incompleteness and multiple languages. This is the case although a large number of resource have been applied to the declarative grammar approach.
Early in the development of computer science an alternative approach for analyzing expression structure was defined. This is operator precedence parsing. These types of parsers are typically used to convert expressions in infix notation into reverse polish notation for evaluation as used in hand calculators. A well known implementation using the Shunting Yard algorithm developed by Edsger Dijkstra.
The configuration of an operator precedence parser contains associations between operators and control parameters. A control parameter is a 3-tuple comprising a numeric priority, an argument type and semantics for applying the operator. The argument type specifies how the operator selects its arguments. For example, the ‘+’ operator selects one argument from the left side and one argument from the right side of the operator in the expression; the unary ‘−’ operator selects one argument from the right side of the operator in the expression. The semantic for applying the operator is the definition of how to calculate the result of applying the operator to the arguments.
What follows is a worked example for the operator precedence parser. The table in
The simplest algorithm for implementing an operator precedence parser would be
1. Lookup each token in the lexicon and augment them with the control definition in order to create the list of elements. This is called Es.
2. For each token lookup up the control in the lexicon.
3. While there is more that one element in the list, Es
4. Emit the list Es as the result
What follows is a worked example for the input “6+5*−2” assuming steps one and two have been performed.
The calculation is now complete.
This example illustrates the main control structure which comprises a priority for determining when to apply an operator, a type for determining what the apply the operator to and a semantic for calculating the result of applying the operator. What follows is an invention that extends the features of the operator precedence parser to be able to handle constructs that arise in natural languages used by humans.
The basic operator precedence parser specifies a priority and semantic for each operator. The priority is a number that is used to determine the order that operators are applied. The semantic is an method that provides the steps to perform a calculation on the arguments. In the enhanced operator precedence parser presented here the priority and semantic specification are extended. This is know as the control definition. The control is a list of single control definitions.
This invention extends the control definition for operators as follows. For reference to the simple control definitions see
The priority is extended from being a single number to a pair. There is one priority for the element as an operator and one priority for the element as an operand. When selecting the highest priority operator, the operator priority is used. When being selected as an operand for another operator, the priority of the element as an operand is used. A subsequent example will demonstrate verbs and adverbs being defined as operators. An adverb will take a verb as an operand and will evaluate to a verb type operator. When the priority as an operator and an operand are the same only one number will be show in the table.
The type is extended to provide operators for selecting arguments from the matrix containing the current set of elements. The type is a pair of selection operators. One for selection from the left of the operator and one for selection from the right of the operator. The selection commands are defined in
After the operator is applied based on the priority and the arguments are selected using the selectors, the evaluator is used to calculate the value of applying the operator the arguments.
The matrix that the processing is performed in can contain any type of element as long as the semantics can process them. One such type is the linguistic frame. This will be shown for illustrative purposes. Examples of elements that could be in the matrix are phonemes, characters, words, function calls. Although the examples demonstrate parsing and tokenization, other stages of analysis can be performed in this model as well.
For extant implementations the linguistic frame is used to represent the structure of the utterance. The linguistic_frame has the form linguistic_frame(Type, Word, Cases). Cases a list that associates a case name with a value. A sample linguistic_frame would be
linguistic_frame(verb, is, [subject-(the sky), direct_object-blue]).
This would correspond to a sentence such as “The sky is blue”.
The final piece is to update the algorithm converting the input into a value. The generalized method is as follows.
Input: A sequence of characters.
Output: A sequence of elements
Method:
1. Lookup each token in the lexicon and augment them with the control definition in order to create the matrix of elements. This is called Es.
2. Set a list of operators to empty. This is called O.
3. Find the element in Es with the highest operator priority not in O. This is called E.
4. Find the element with the highest operator priority
5. If such an element exists remove the first control definition and replace it in the list Es. Proceed to step 2.
6. Emit the current sequence as the output.
What follows is a worked example of the enhanced operator precedence parser. Assume the lexicon in show in the table in
The first iteration of the loop produces the matrix shown in
2. The list O is set to empty
3. The element with highest priority is number C3.
The second iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is element D6.
3. Find the element in Es with the highest operator priority not in O. This is called element D4.
The third iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is element E5.
3. Find the element in Es with the highest operator priority not in O. This is called element E9.
E. The result of applying element E9 to element E10 has a control that is the same as element E10. This is denoted by the absorb(after) term. The value is the value of the element E10. This is denoted by the select(1) term that selected the Nth Argument. Note that the value can be more sophisticated and contain information about the determiner. For simplification this is not shown. The result value is at position F9.
The forth iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is element F5.
3. Find the element in Es with the highest operator priority not in O. This is called element F3.
The fifth iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is element G4.
3. Find the element in Es with the highest operator priority not in O. This is called element G7.
The sixth iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is element H4.
3. Find the element in Es with the highest operator priority not in O. This is called element H2.
The eighth iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is element I3.
3. Find the element in Es with the highest operator priority not in O. This is called element I4.
The ninth iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is called element J3.
The tenth iteration of the loop produces the matrix shown in
2. The list O is set to empty.
3. Find the element in Es with the highest operator priority not in O. This is called element K2.
Further processing is not shown. The fact frame evaluator can be used to map the linguistic structures into data structures that allow for further processing that could for example perform an action or record a fact.
What follows is an example of using the enhanced operator precedence parser to tokenize an input. The parser can be used to tokenize as well as to perform higher level analysis as in the previous example at the same time. The lexicon for the tokenization example is shown in
For the following input, step one is applied where each character is a token
“Jane jumped”
After applying Step 1 of the method the matrix of elements looks like the table shown in
For the first pass the highest priority operator is element B5. The selector select elements B1 to B4. The default evaluate uses wrap to group the arguments in a list as the new value. The control for the new value is to replace. The result of the first iteration is shown in
The processing then continues with a lookup in the dictionary. These definitions can be combined with previous example to produce a system that can process input from a bare utterance. What comprises tokenization and parsing in typical models is integrated into a single model.
A working version of the algorithm would include a more sophisticated control structure. The control structure would allow for alternative using a backtracking algorithm similar to Prolog. To simplify the presentation this is not shown. As well, the algorithms for backtracking are well known and easily applied.
For each element, properties could be maintained to further characterize the element. These properties could be used in the selection process as well as to maintain semantics. When using the enhanced operator precedence model for analysis, structures from languages other than English are represented with interoperable definitions. This allows utterances that contains mixed languages to be seamlessly processed. Other layers of definitions could be added to support converting sounds into elements that are then tokenized and further processed. This would provide a seamless model for processing speech into action.