1. Technical Field
The present embodiments include systems and methods for efficient memory usage in speech recognition, and more particularly to efficient systems and methods for the compilation of static decoding graphs.
2. Description of the Related Art
The use of static hidden Markov Model (HMM) state networks (search graphs) is considered one of the most speed efficient approaches to implementing synchronous (Viterbi) decoders. The speed efficiency comes not only from the elimination of the graph construction overhead during the search, but also from the fact that global determinization and minimization provides the smallest possible search space.
Determinization and minimization procedures are known in the art and provide a reduction in a final graph for decoding speech. Minimization refers to the process of finding a graph representation, which has a minimum number of states. Determinization refers to the process of finding state sequences where each state sequence produces a unique label sequence (labels are associated with arcs). The graphs referred to herein are generally search graphs, which indicate a solution or a network of possibilities for a given utterance or speech.
The use of finite state transducers (FST) has become popular in the speech recognition community. Finite state transducers (FST) provide a solid theoretical framework for the operations needed for search graph construction. A search graph is the result of a composition
C o L o G (1)
where G represents a language model, L represents a pronunciation dictionary and C converts the context independent phones to context dependent HMMs. The main problem with direct application of the composition step is that it can produce a non-deterministic transducer, possibly much larger than its optimized equivalent. The amount of memory needed for the intermediate expansion may be prohibitively large given the targeted platform.
Many techniques proposed for efficient search graph composition restrict the phone context to triphones, since the complexity of the task grows significantly with the size of the phonetic context used to build the acoustic model, particularly when cross-word context is considered. For large cross-word contexts, auxiliary null states may be employed using a bipartite graph partitioning scheme. In a prior art suggested approximative partitioning method, the most computationally expensive part is vocabulary dependent. Determinization and minimization is applied to the graph in subsequent steps.
Another technique builds the phone to state transducer C by incremental application of tree questions one at a time. The tree can be built effectively only up to a certain context size, unless it is built for a fixed vocabulary. This method still relies on explicit determinization and minimization steps in the process of the composition of the search graph.
A system and method for building decoding graphs for speech recognition are provided. A state prefix tree is given for each unique acoustic context. The prefix trees are traversed to select a subtree of arcs and states to be added to a final decoding graph wherein the states and arcs are added incrementally during the traversing step such that the final graph is constructed deterministically and minimally by the construction process.
These and other objects, features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
The present disclosure provides an efficient technique for the compilation of static decoding graphs. These graphs can utilize full word of cross-word context, either left or right. The present disclosure will illustratively describe use of left cross-word contexts for generating decoding graphs. One emphasis is on memory efficiency, in particular to be able to deploy the embodiments described herein on platforms with limited resources. Advantageously, the embodiments provide an incremental application of the composition process to efficiently produce a weighted finite state acceptor, which is globally deterministic and minimized with the maximum memory need during the composition, essentially the same as that needed for the final graph. Stated succinctly, the present disclosure provides a system and method, which builds a final graph in a way that provides a deterministic and minimized result by virtue of the process, and not by employing separated deterministic and minimization algorithms.
Suitable methods considered herein include vocabulary independence, maximal memory efficiency and the ability to trade speed for complexity. By vocabulary independence, the vocabulary can be changed without significantly affecting the efficiency of the algorithm. In some situations, the grammar G is constructed before the recognition starts, defining the vocabulary. For example, in dialog systems the grammars are composed dynamically in each dialog state. In another case, the user is allowed to customize the application by adding new words.
A more complex model can be used for greater recognition accuracy, e.g. wider cross-word context with a trade-off against speed of the graph building. However, if speed is needed as well, one can use a model with reduced context size to meet the requirements.
Use of the left cross-word context is described, however right cross-word context can also be employed with increased complexity of right context cross-word modeling. IBM acoustic models are typically built with 11-phone context (including the word boundary symbol), which means that within the word the context is ±5 phones wide in each direction and the left cross-word context is at most 4 phones wide.
It should be understood that the elements shown in FIGS. may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in software on one or more appropriately programmed general-purpose digital computers having a processor and memory and input/output interfaces. In addition, advantageously, in accordance with the teachings herein, memory buffers and memory storage may be provided as ROM, RAM or a combination of both. Each block or blocks may comprises a single module or a plurality of modules for implementing functions in accordance with the illustrative embodiments described herein with respect to the following FIGS.
Referring now to the drawings in which like numerals represent the same or similar elements and initially to
Referring to
The set of left context classes is constructed by simply enumerating all phone k-tuples observed in all lexemes. This is an upper bound as some phone sequences will have the same left context effect. As the graph is built, those classes with a truly unique context will be automatically found by the minimization step. For this reason, it is preferred to perform the connection of each lexeme arc to its corresponding unique tree root in a separate final step, after all trees for all contexts have applied to the graph.
For state equivalence testing performed during the incremental build, a hash table is preferably employed used. The state is represented by a set of arcs, each arc may be represented by a triple (destination state, label, cost). To minimize the amount of memory used by the hash table, the hash was implemented as a part of the algorithm. In a stand-alone hash implementation, the key value is stored in the hash table for conflict resolution. This would effectively double the amount of memory needed to store the graph. Advantageously, the memory structure provided herein for the graph state representation includes records related to the hashing, i.e. a pointer for the link list construction and the hash lookup value (the graph state id). In this way, the hashing adds only 8 bytes to each graph state.
Referring to
In block 214, a check is performed to determine if all leaves have been processed. If all the leaves have been processed, the remaining states and arcs in the states and arc buffers are merged with a final graph in block 234. Otherwise, in block 218, a next leaf is selected from the sorted list of leaves. The selected leaf is merged with the final graph. Then, the state, which is a parent node to the child leaf is selected. In block 20, the level of the selected state is determined and a new arc is created from the selected state (in this case the parent node) to the previously selected state (child).
In block 222, a check is performed to determine whether the state level buffer includes a waiting state for that level. A waiting state is a conditional state where the outcome of processing other nodes may still affect the disposition of the node in the waiting state. The waiting state is used to determine if any other processing has used a state at the presently achieved level in the graph. In other words, has any processing at the parent level been previously performed? If it has then that state (or node) is in a waiting state. If a waiting state is included, then in block 224, it is determined whether the waiting state is the same as a current selected state. If the selected state is the same as the waiting state, a new arc is added to the arc level buffer going toward the root of the tree in block 216, and the process returns to block 218 where the next leaf or state is considered.
If the waiting state is not the same as the selected state then the waiting state and corresponding arcs are merged from the arc level buffer into the final graph in block 228. By virtue of the setup of the prefix tree, the waiting state and the arcs can be committed to the final graph at this early stage since all possibilities have been considered previously for the waiting state. If the state level buffer does not include a waiting state (from block 222) or the waiting state has been merged with the final graph (block 228), then, the selected state is added to the state level buffer as waiting and the corresponding arcs are added to the arc level buffer, in block 226. Processing returns to block 226 until all leaves of that level are considered and processed.
In block 230, a determination is made as to whether the state is a root of the tree. If it is the root, processing continues with block 214. Otherwise, in block 232, a parent of the selected state is selected and processing returns to block 220.
By traversing the states and arcs in this way, a final graph is constructed incrementally and having the characteristics of being deterministic and minimized. This is particularly useful in memory-limited applications.
Deterministic acyclic finite state automata can be built with high memory efficiency using this incremental approach. The final graph is not necessarily acyclic (certainly not if it is an n-gram model), but the cyclic graph minimization is not needed assuming that the grammar G is provided in its minimal form.
One distinct feature of present method and system is that the amount of memory needed to store the graph at any point will not exceed the amount of memory needed for the final graph. It should be understood that the actual graph representation during the composition needs more memory per state than the final representation during the decoding, but it is fair to say that the memory need is O(S+A), where S is the number of states and A is the number of arcs of the final graph.
The efficiency of the present disclosure has been achieved by using finite state actuators (FSA) rather than finite state transducer (in the prior art). Using acceptors rather than transducers makes the operations such as determinization and minimization less complex. One concept includes the combination of all steps (composition, determinization, minimization and weight pushing) into a single step.
The method/system described with reference to
A deterministic prefix tree T is constructed which maps HMM state sequences to pronunciation variants of words (lexemes) in G. Each unique arc sequence representing an HMM state sequence is terminated by an arc labeled with the corresponding lexeme.
All arcs leaving a particular state of G are replaced by a subtree of T with the proper scores assigned to the subtree leaves. The operation which performs this replacement on all states of G is denoted as RT(G).
The resulting FSA is deterministic. The minimization (including weight pushing) is also included into the subtree selection so the resulting FSA is a minimized as well.
This minimization is done locally, which means that its extent is limited to subtrees leading to same target states of G. This is due to the fact that the algorithm preserves states and arcs of G. If a and b are two different states of G, then:
a≠b→L(Ga)≠L(Gb)→L(RT(Ga))≠L(RT(Gb)), (2)
Where Ga is maximal connected sub-automaton of G with start state “a” and L(Ga) is the language generated by Ga. In another words, if G is minimized, the algorithm cannot produce a graph, which would allow the original states to merge. This has important implications. To minimize the composed graph, only local minimization needs to be performed, e.g., any two states of the composed graph need to be considered for merge only if they lead to the same sets of states of G. This minimization is acyclic and thus very efficient (algorithms with complexity O(N+A) do exist). The subtree selection is applied incrementally to each state of G. As the states of the subtree are processed, they are immediately merged with the final graph in a way, which preserves the final graph minimal.
It should be mentioned that the minimized FSA may be suboptimal in comparison to its equivalent FST form, since the transducer minimization allows input and output labels to move. While this minimization can still be performed on the constructed graph, it is avoided for practical reasons as it is preferable to place the lexeme labels at the word ends.
The system and method use a post-order tree traversal. Starting at the leaves, each state is visited after all of its children had been visited. When the state is visited, the minimization step is performed, e.g., the state is checked for equivalence with other states, which are already a part of the final graph. Two states are equivalent if they have the same number of arcs and the arcs are pair-wise equivalent, i.e. they have the same label, cost and destination state. If no equivalent state is found, than the state is added to the final graph. Otherwise, the equivalent state is used. A hash table may be used to perform the equivalence test efficiently.
In useful implementations of the post order processing, account is taken that only a subset of the tree, defined by the selected leaves corresponding to the active lexemes, needs to be traversed. The node numbering follows pre-order traversal. The index (number) of each leaf corresponds to one lexeme. Each node also carries information about its distance to the root (tree level).
One aspect of the minimization may include weight pushing. This concept fits naturally in the postorder processing framework in accordance with the embodiments described herein. The costs are initially assigned to selected leaves. As the state of the prefix tree are visited, the cost in pushed towards the root using algorithm described in the prior art.
Referring to
Starting with the top leaf, in this case, leaf 4, the tree is traversed towards a root (1), and all states along the path are marked as waiting (hexagons). When the second leaf is processed, in
In
The same process is performed in
The upper bound on the amount of memory needed to traverse the tree is proportional to the depth of the tree times maximum number of arcs leaving one state. The memory in which the tree is stored does not need write access and neither the memory nor the computational cost of the selection depends directly on the size of the whole tree. In situations where the vocabulary does not change or when a large vocabulary can be created to guarantee the coverage in all situations, the tree can be precompiled and stored in ROM or can be shared among clients through shared memory access. Since ROM is cheaper, the present disclosure provides the ability to mix ROM and RAM memories in a way that can optimize memory efficiency and reduce cost.
In left cross-word context modeling, instead of one prefix tree a new tree needs to be built for each unique context. The number of unique contexts theoretically grows exponentially with the number of phones across the word boundary. In practice, this is limited by the vocabulary. The number of phones inside a word, which can be affected by the left context does not have a significant effect on the complexity of the algorithm.
After a final graph has been determined, the final graph is employed in decoding or recognizing speech for utterance.
The effect of the context size on the compilation speed for two tasks has been tested. The first task is a grammar (list of stock names) with 8335 states and 22078 arcs. The acoustic vocabulary has 8 k words and 24 k lexemes. The second task in an n-gram language model (switchboard task) with 1.7M of 2-grams, 1.2M of 3-grams 86 k of 4-grams, with a vocabulary of 30 k words and 32.5 k lexemes. The compilation time was measured on a LINUX™ workstation with 3 GHz Pentium 4 CPUs and 2.0 GB of memory and is shown is table 1.
While the efficiency suffers when the context size increases, the computation is sped up for large contexts by relaxing the vocabulary independence requirement and precomputing the effective number of unique contexts. Given a fixed vocabulary, the number of contexts is limited by the number of unique combinations of last n phones of all words. But, some of the contexts will have the same cross word context effect. For those contexts only one context prefix tree needs to be built. Table 2 compares the limit and effective values of context classes on both tasks. The effective value can be found as the number of tree roots in an expansion of a unigram model. This expansion is in fact a part of any backoff n-gram graph compilation and represents the most time consuming part of the expansion.
A much larger n-gram model was employed to test the memory needs of the process. While keeping the total memory use below 2 GB, a language model was compiled into a graph with 35M states and 85M arcs.
A system and method for memory efficient decoding graph construction have been presented. By eliminating intermediate processing, the memory need of the present embodiments is proportional to the number of states and arcs of the final minimal graph. This is very computationally efficient for short left crossword contexts (and unlimited intra-word context size), but it can also be used to compile graphs for a wide left crossword context without sacrificing the memory efficiency.
Having described preferred embodiments of memory efficient decoding graph compilation system and method (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.