The manner in which humans interact with computing devices is rapidly evolving and has reached the point where human users can access services and resources on the Internet using natural language. Speech recognition software tools continue to improve in terms of the fidelity with which human speech is captured despite the tremendous variation with which the input is delivered. However, there is still considerable work to be done with regard to making such input understandable to machines. That is, in order for a user's verbal request to be fulfilled, not only must the input be accurately and reliably captured, but the semantic meaning the input represents must be accurately and reliably translated to a form with which a machine can work. The extent to which the accuracy and reliability with which the input is entered or captured is compromised, this goal is undermined. For example, if the user is entering the input in a text interface, misspellings (either by the user or an auto-correction feature of the interface) and grammatical errors can result in the received text being radically different than the intended semantic meaning. In the context of speech recognition, misrecognition of words and phrases can lead to similar results.
This disclosure describes techniques for translating natural language input, either in the form of captured speech or entered text, to a form which is understandable to a computing device and that accurately represents the semantic meaning of the input intended by the user. In addition to matching the words and phrases of the received input to stored words and phrases, the system described herein also takes into account various probabilities associated with the received input that model what the user is likely to have actually said or intended given the received input. These probabilities are used to assign scores to different translation options from which the system can then select the best (or none if appropriate). An example may be illustrative.
Suppose the user asks the system “What is the airspeed velocity of an unladen swallow?” and the speech recognition software captures this input as “what is the nearest tree velocity oven laden swallow.” Conventional approaches to translation of this input might just take this text and attempt a response based on the mistranslation or specific keywords recognized in the translation, potentially resulting in a poor user experience. By contrast, using the techniques described herein, the system can better understand what the user is likely to have said based on the received input, translate the intended semantic meaning accurately, and generate an appropriate response, e.g., “Which do you mean, African or European swallow?”
It should also be noted that, despite references to particular computing paradigms and software tools, the computer program instructions on which various implementations are based may correspond to any of a wide variety of programming languages, software tools and data formats, and may be stored in any type of non-transitory computer-readable storage media or memory device(s), and may be executed according to a variety of computing models including, for example, a client/server model, a peer-to-peer model, on a stand-alone computing device, or according to a distributed computing model in which various functionalities may be effected or employed at different locations. In addition, reference to particular protocols herein are merely by way of example. Suitable alternatives known to those of skill in the art may be employed.
According to a particular class of implementations, service 102 is a knowledge representation system for filling information needs (e.g., question answering, keyword searching, etc.) using a knowledge base 110 that stores information as structured data 112. Knowledge base 110 includes speech recognition logic 113 that captures human speech as natural language input, and translation logic 114 which translates between the natural language input and a machine-readable format that is compatible with structured data 112. That is, translation logic 114 translates natural language input (e.g., captured speech or entered text received from device 106-5) to machine-readable queries that are then processed with reference to structured data 112. Translation logic 114 also translates responsive data from structured data 112 and/or its own machine-readable queries requiring further input to natural language for presentation to the user (e.g., on device 106-5). Query processing logic 116 processes machine-readable queries generated by translation logic 114 with reference to structured data 112 to generate responses which may then be translated to natural language by translation logic 114. Knowledge base 110 also includes translation templates 118 that are selected by translation logic 114 to efficiently translate the received natural language input according to the techniques described herein.
It should be noted that implementations are contemplated in which at least some of the functionality depicted as being included in knowledge base 110 of
According to a particular class of implementations, the system in which natural language translation as described herein is supported (e.g., service 102 of
An example of the operation of a specific implementation of natural language translation logic in the context of a knowledge representation system will now be described with reference to the flow diagram of
The text string is selected (204) and an input graph is generated (206) representing the different ways to break up the string for pattern matching. For example, the string “what is the capital of france” can be broken up into [“what is”, “the capital of”, “france”] or [“what is the”, “capital of france”], etc. There are 2^(n−1) distinct breakups or breakup elements for a string having n words. This may be represented in a graph with n(n+1)/2 edges. An example of such a graph for the string “what is the capital of france” is shown in
According to a particular implementation, each translation template includes a pattern of fixed string expressions. The template graph, e.g., a finite space automaton, represents the patterns of the translation templates, i.e., the regular expression portions of the templates. The edges of the template graph represent strings or variables from the templates that can be bound to strings of the input. As will be discussed, the edges of the template graph may be weighted to reflect a variety of probabilities, e.g., the likelihood of reaching a particular understanding via that edge. Terminal nodes of the template graph correspond to templates to be used for translation if reached. Different terminal nodes have different priorities based on the priority of the corresponding template. According to various implementations, a template's priority is represented by a value or score that specifies a priority ordering that may be determined by a number of template characteristics, e.g., how likely the corresponding understanding is; template specificity (e.g., more specific templates might tend to have a higher priority); templates that handle exceptions for templates further down the priority ordering might have a higher priority; etc. The priority score might represent a combination of more than one of these template characteristics. All nodes in the template graph are labeled with the highest priority template that can be reached from that point in the graph. An example of a portion of such a graph representing a single template pattern is shown in
Edges of the template graph are labeled with integers or the label “var.” Integers represent transitions to follow if a particular fixed string is read. The label “var” represents transitions to follow if binding a substring of the input to a variable. In the example depicted in
Matching the input graph to the template graph proceeds using automata selection. Nodes in the resulting intersection graph are tuples (v,u) of the nodes of the input graph (u) and the template graph (v). An edge exists between the nodes in the intersection graph (u1,v1) and (u2,v2) if, and only if, there is an edge in the input graph between (u1) and (u2), there is an edge in the template graph between (v1) and (v2), and the labels on these edges are compatible. Compatibility may mean an exact match, but can include insertion or removal of a word to achieve a match (in which case a penalty may be added). The label “var” on a template graph edge is compatible with any label on an input graph edge, and integer labels on a template graph edge are compatible with labels in the input graph if, and only if, the set of strings corresponding to the integer contains the string on the input graph edge. A node in the intersection graph is a terminal node if, and only if, it is a terminal node in both the template graph and the input graph. The template priorities associated with the nodes of the template graph cause the traversal of the intersection graph to proceed preferentially following edges from nodes that have the highest priority terminal nodes as children. As indicated above, the matching node for a given text string might represent the fact that an appropriate translation template does not exist. In some cases, a “var” edge of a template graph may be “skipped” by, for example, binding the corresponding variable to a placeholder value (e.g., a common or expected value given the context of the variable) or a null value, and/or by modifying the processing of the resulting query to accept a placeholder/null value for that variable.
Referring again to
The hypothesis with the highest statistical score is selected (218), and if the highest priority matching node corresponds to a translation template (220), the variable bindings for the matching template are determined (222), the semantic checks associated with the template are performed to verify that the variable bindings make sense for that template (224) and, if so, a machine-readable query is generated for processing by the knowledge base (226). If, on the other hand, the selected node represents the decision not to understand the input (220), the lack of understanding is communicated to the user (228) or, alternatively, the input may be ignored.
Some template patterns may include several consecutive variables and few fixed strings of text. This may result in matching of such templates to a large number of input strings. Performing the semantic checks for each variable for every match may represent a performance bottleneck. Therefore, according to a particular implementation, we take advantage of the fact that the failure of a particular variable binding for a given input text breakup element obviates the need to perform related semantic checks for that matching node, e.g., if a particular variable binding will always fail for a given breakup element, semantic checks for other combinations of variables and breakup elements associated with the failed binding need not be run. Further, if there are multiple matching nodes for which the variable binding for a particular breakup element would otherwise need to be checked, the failure of the binding for one of the nodes may be used to reduce the number of semantic checks for any other node for which that variable and breakup element combination arises.
According to a particular implementation, one or more semantic checks for each template are extracted and tested to determine whether a particular binding of a variable will cause the entire set of semantic checks to fail. These “partial” semantic checks have the property that they refer to just a single variable from a pattern, and that if they fail for a particular binding then the overall set of semantic checks will also fail. When we get a match against a template pattern, we take the graph generated by matching nodes against each other from the input and template graphs, and work backwards from the matching terminal node enumerating possible paths to the start. Each of these paths represents a particular way to bind sections of the input string to the variables. When we encounter an edge that represents a variable for which we managed to successfully extract a semantic check with which to test it, we perform the test. If the test passes we continue working backwards to the root. If it fails then we backtrack and do not generate any paths using that edge. We also store the result of testing that semantic check with that binding in a hash map so that if any other paths make the same test we can rule those out quickly too.
Statistical Scoring Scheme
According to a particular implementation, the statistical scoring scheme employs a Bayesian approach that attempts to identify the most likely input intended by the user given the text received. Assume the user intended input text T (which translates with template U), but the system actually received input text T′. The scoring scheme attempts to maximize the expression P(T, U|T′), i.e., the probability of T and U given the text received, i.e., T′. This expression may be approximated as follows:
P(T,U|T′)=P(T′|T,U)P(T,U)/P(T′) (1)
αP(T′|T,U)P(T,U) (2)
=P(T′|T)P(T,U) (3)
=P(T′|T)P(T|U)P(U) (4)
where P(T′|T) represents the probability that, if the user intended the text T, T′ is actually received; P(T|U) represents the probability of the user intending a particular input text T given that the intended text would translate with a particular template U; and P(U) represents the prior probability that the user intended an input which would/should be translated with template U.
Equation (1) represents the application of Bayes' theorem to the expression we wish to maximize. Expression (2) follows because T′ is a constant for all values of T and U. Expression (3) assumes that the probability of substituting T for T′ is approximately independent of U. Finally, expression (4) applies the chain rule to separate P(T, U) into two components. Note that we are treating the value of U as being determined by T, i.e., we take each hypothesis of what the user intended to say T and run that through the standard translation system to get the template match U that corresponds to that text. As mentioned above, we account for the possibility that a question doesn't translate at all by treating not translating as another logical template with its own prior probability. The problem therefore becomes one of finding, for a given input text T′, the value of T (and therefore also U) that maximizes expression (4).
Log Probabilities
To train the statistical models we used a log of inputs actually received by the system to represent the statistics for the inputs intended by users. We also used the standard approach of taking the logarithm of all probability values before manipulating them. For example, the probability 0.0005 becomes a log probability of −7.601. To multiply together probabilities we can add their corresponding log-probabilities, since:
log(ab)=log(a)+log(b) (5)
So, for example if probability a is 0.0005 (with a log probability value of −7.601) and probability b is 0.0001 (with a log probability of −9.210) to calculate the log probability of the product of a and b we can just add the two individual log probabilities to get −16.811.
To divide probabilities we subtract one from the other, since:
log(a/b)=log(ab−1)=log(a)+log(b−1)=log(a)−log(b) (6)
To add and subtract probabilities in the log domain requires using the exp( ) function, but for these models we don't have to add or subtract probabilities at run-time. Smaller probabilities correspond to more negative log-probabilities. The largest possible probability of 1.0 corresponds to a log probability of 0.0, and the smallest possible probability of 0.0 corresponds to a log probability of negative infinity.
Sparse Data and Good-Turing Estimation
A common problem with many types of linguistic data (for example, words occurring in documents or log entries) is that however much training data is read there will always be events that are not seen in the training data. The naive approach to calculating the probability of a word that appears r times in the training data would be to divide r by the total number of words in the training text N (this is referred to as the “maximum likelihood” estimate). This estimate would not be appropriate for this type of data since it would estimate the probability of any word not seen before as zero. One technique for dealing with this problem is the Good-Turing estimator described in Good-Turing Frequency Estimation Without the Tears, William Gale and Geoffrey Sampson, pp. 217-37 of the Journal of Quantitative Linguistics, vol. 2, 1995, the entirety of which is incorporated herein by reference for all purposes.
The Good-Turing estimate of the probability p0 of seeing a word that we haven't seen before is:
p0=n1/N (7)
where n1 is the number of words seen just once in the training data, and N is the total number of words in the training data. An intuition for this is that when we saw each of these words they were previously an unseen word, so a good estimate for the probability of seeing a new unseen word is the number of times we saw a word in the training data that was previously unseen. Note that this doesn't tell us the probability of seeing a particular word that we haven't seen before, just the total probability mass that should be set aside for words that weren't in the training data.
Extending the intuition, a good estimate for the fraction of words in a new data stream that we expect to see r times is the fraction of words that were seen r+1 times in the training data, since when we saw them the r+1st time they were words that we had seen just r times. For the probability pr of seeing any of the words that were seen r times in the training data we have:
pr=(r+1)nr+1/N (8)
Again this gives just the total probability mass that should be assigned to all words that occurred r times, not the probability of a particular word that occurred r times. However, in the case r>0 we know how many of these words there actually are, so we can assign a probability to a particular word:
P(w)=(r+1)nr+1/(Nnr) (9)
where w is a word that occurred r times in the training data.
In practice a problem with this estimator is that there will be gaps in the counts that occur in the training data; particularly for high values of r. For example, while there will usually be plenty of words that occurred 1 to 3 times, there might only be one word that occurred 23,452 times, and no words that occurred 23,453 times, and then one word that occurred 23,454 times. The Gale and Sampson paper suggests a solution to this which they term the “Simple Good Turing” approach. We first average out the counts across the zeros. To quote them directly:
Gale and Sampson then take the log of nr and the log of r, perform a linear regression to get a line of best fit and then define a criteria for switching from Good-Turing smoothing based on the Zr counts directly and the counts that would be estimated from the best fit line. However, we take a slightly simpler approach.
We expect the values Zr and r to follow approximately the following relationship:
Zr=Arb (10)
where A and b are constants that are inferred from the data. We estimate these constants by using a standard function minimization library to choose the values that minimize the expression:
Err(A,b)=Σr|Arb−Zr| (11)
And then we can estimate P(w) using the inferred relationship:
P(w)=(r+1)(A(r+1)b)/(NArb)
=(r+1)b+1/(Nrb) (12)
This approach is more accurate for lower values of r than the method proposed by Gale and Sampson, so there is no need to have a criteria for switching between estimation methods.
We keep the estimate n1/N for the total probability of all unseen words, and we consider this reliable. However, when we add the probabilities for the other words from equation (12) we may find that the probabilities no longer sum to one. Therefore we scale the probabilities derived from (12) to make sure this condition is satisfied.
The Template Probability Model P(U)
We calculate template probabilities by running log questions through the system using only the best speech recognition output and not considering any substitutions, and counting how often each template is matched and how often we are not able to understand inputs. These counts follow a similar distribution to words: there are a few templates that get matched very frequently, a large number that are matched only once or twice even in a large number of log entries, and many that we never see matched at all. Therefore we apply Good-Turing smoothing to estimate the template probabilities, dividing the unseen mass p0 equally between the templates that we know have not been matched. We do not apply Good-Turing smoothing to the probability of not translating the input, since this does not seem to follow the same distribution as the translated template probabilities.
In the scoring scheme we simply take the log of the Good-Turing probability estimate of the probability of U and use that directly:
S(U)=log PGT(U) (13)
The Text Probability Model P(T|U)
We calculate word probabilities by assuming independence between the words and simply counting the number of times we see different words matched with different templates (or words that are in inputs that are not understood.) That is:
P(w1n|U)=Π1≦i≦nP(wi|U) (14)
where w1n represents the sequence of words W1, w2, . . . wn. We estimate P(wi) using a “back-off” model: if the word wi was seen one or more times with the template U then we use the Good-Turing estimate of the probability of that word using the statistics collected from that template. If it was not seen before then we back off to using the statistics collected from all the inputs, after applying a normalization factor to make sure the probabilities sum to one:
P(wi|U)=PGT(wi|U) if cU(wi)>0 (15)
αUPGT(Wi) otherwise,
where PGT is the Good-Turing probability estimate, cU(w) is the number of times word w was seen in the training data with template U, and αU is a normalisation factor. αU is determined by the total probability of all the unseen words in the model conditioned by the template match, and the total probability of all the words in the unconditioned model that cannot be used because they are masked by the probabilities in the more specific model:
αU=(1−Σw:c
Further information regarding this model (referred to as the Katz back-off model) may be found in Estimation of Probabilities From Sparse Data for the Language Model Component of a Speech Recognizer, Katz, S. M., IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400-401 (1987), the entire disclosure of which is incorporated herein by reference for all purposes.
For the Good-Turing estimate that is not conditioned by the template match PGT(w) we divide the probability mass for the unseen words by an arbitrary number to get an estimate of the probability for a specific word that was not seen anywhere in the training data. Since there's no data from which we can set this value (and it's not clear that it is meaningful anyway) we choose an arbitrary value that results in estimates that are in some sense not too big and not too small.
One problem with the above approach is that by using equation (14) we assign exponentially decreasing probability values to longer utterances. This means that if we insert a word into an utterance we are likely to get a much lower probability score, and if we remove a word we are likely to get a much higher probability score. In practice we use log probabilities when calculating the value in (14):
log P(w1n|U)=Σ1≦i≦n log P(wi|U) (17)
To avoid the bias towards shorter sentences, we calculate the average value k of log P(wi|U) in a held out test set, and subtract that value from the total log probability for each word to get a score S(w1n|U) for the text that is in some sense independent of the number of words read:
S(w1n|U)=Σ1≦i≦n log P(wi|U))−nk (18)
The Substitution Model P(T|T′)
In the case of speech recognition alternatives we didn't get confidence scores from our speech recognition systems, so we take the approach of applying a fixed score penalty hspeech for choosing an alternative that is not the best hypothesis:
Sspeech(T|T′)=hspeech if T≠T′ (19)
0 otherwise.
If, on the other hand, confidence scores are associated with the alternative interpretations, these may be used to generate values for T and T′.
In the case of spelling correction, we count the number of character edits made, nchar, and the number of sounds-like edits made, nsoundslike, and apply a penalty hchar and hsoundslike for each.
Sspelling(T|T′)=hsoundslikensoundslike+hcharnchar (20)
It should be noted that implementations are contemplated in which a Hidden Markov Model is trained for edits based on training examples.
The Final Score and Thresholding
For the final score we compare the score for the hypothesized text and understanding to the score for not making any substitutions and not understanding the input. This makes sure that we don't reject a substitution simply because the words it contains are unlikely, but instead looks at whether we are making the data more likely than it was before by making the changes:
S(T,U|T′)=S(T|T′)+S(T|U)+S(U)
−S(T′|T′)−S(T|Unot-understood)−S(Unot-understood) (21)
where Unot-understood represents the hypothesis that the input should not be understood.
To decide whether to make a substitution or just leave the input as being not understood we threshold this score and only perform the change if the score exceeds the threshold. According to a particular implementation, the selection of the threshold is a manual process based on an evaluation of the quality of results with different scores from a test set.
Model Component Evaluation
We can evaluate the quality of the components P(T|U) and P(U) of the model using a score referred to as perplexity. For example, if we want to calculate the perplexity of a sequence of test words w1n using a model P(w) the perplexity is defined as:
PP(w1n)=exp(−(Σi log P(wi))/N) (22)
In general better models will have a lower perplexity scores for held out test sets.
According to some implementations, both the input graph and the template graph may be expanded to include edges that represent multiple speech alternatives (e.g., misspellings or homophones) for particular words or phrases to expand the range of possible inputs and matches. The selection of these alternatives for inclusion may be guided by some of the probabilities used in the statistical scoring. While traversing the graphs, terminal nodes reached via paths including such alternatives could be penalized in a way that is represented in the final statistical scoring. As will be understood, such an approach supports score computation that approximates expression (4) as the graphs are being traversed, i.e., removing U related terms from the equation by replacing P(T|U) with P(T), and replacing P(U) with the probability that any arbitrary template would be matched. This would be an approximation of the score calculated once a terminal node corresponding to a particular U is reached. At the end, the possible matching templates could be prioritized initially by these “accumulated” scores and then reordered using the final scoring.
While the subject matter of this application has been particularly shown and described with reference to specific implementations thereof, it will be understood by those skilled in the art that changes in the form and details of the disclosed implementations may be made without departing from the spirit or scope of the invention. Examples of some of these implementations are illustrated in the accompanying drawings, and specific details are set forth in order to provide a thorough understanding thereof. It should be noted that implementations may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to promote clarity. Finally, although various advantages have been discussed herein with reference to various implementations, it will be understood that the scope of the invention should not be limited by reference to such advantages. Rather, the scope of the invention should be determined with reference to the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
5268839 | Kaji | Dec 1993 | A |
5640575 | Maruyama et al. | Jun 1997 | A |
5895446 | Takeda et al. | Apr 1999 | A |
6076088 | Paik et al. | Jun 2000 | A |
7013308 | Tunstall-Pedoe | Mar 2006 | B1 |
7707160 | Tunstall-Pedoe | Apr 2010 | B2 |
8219599 | Tunstall-Pedoe | Jul 2012 | B2 |
8468122 | Tunstall-Pedoe | Jun 2013 | B2 |
8666928 | Tunstall-Pedoe | Mar 2014 | B2 |
8719318 | Tunstall-Pedoe | May 2014 | B2 |
8838659 | Tunstall-Pedoe | Sep 2014 | B2 |
9098492 | Tunstall-Pedoe | Aug 2015 | B2 |
9110882 | Overell et al. | Aug 2015 | B2 |
20020010574 | Tsourikov et al. | Jan 2002 | A1 |
20020133347 | Schoneburg et al. | Sep 2002 | A1 |
20040167771 | Duan et al. | Aug 2004 | A1 |
20060004721 | Bedworth et al. | Jan 2006 | A1 |
20070016400 | Soricutt | Jan 2007 | A1 |
20070043708 | Tunstall-Pedoe | Feb 2007 | A1 |
20070055656 | Tunstall-Pedoe | Mar 2007 | A1 |
20080270428 | McNamara et al. | Oct 2008 | A1 |
20090070284 | Tunstall-Pedoe | Mar 2009 | A1 |
20090192968 | Tunstall-Pedoe | Jul 2009 | A1 |
20090262664 | Leighton | Oct 2009 | A1 |
20100010989 | Li et al. | Jan 2010 | A1 |
20100100369 | Shetty | Apr 2010 | A1 |
20100169352 | Flowers et al. | Jul 2010 | A1 |
20100205167 | Tunstall-Pedoe | Aug 2010 | A1 |
20100235164 | Todhunter et al. | Sep 2010 | A1 |
20100256973 | Chen et al. | Oct 2010 | A1 |
20110044351 | Punati et al. | Feb 2011 | A1 |
20110202533 | Wang et al. | Aug 2011 | A1 |
20110264439 | Sata et al. | Oct 2011 | A1 |
20110307435 | Overell et al. | Dec 2011 | A1 |
20120036145 | Tunstall-Pedoe | Feb 2012 | A1 |
20120041753 | Dymetman | Feb 2012 | A1 |
20120047114 | Duan | Feb 2012 | A1 |
20120096042 | Brockett | Apr 2012 | A1 |
20130253913 | Tunstall-Pedoe | Sep 2013 | A1 |
20130254182 | Tunstall-Pedoe | Sep 2013 | A1 |
20130254221 | Tunstall-Pedoe | Sep 2013 | A1 |
20130262125 | Tunstall-Pedoe | Oct 2013 | A1 |
20130275121 | Tunstall-Pedoe | Oct 2013 | A1 |
20140025689 | Kang | Jan 2014 | A1 |
20140067870 | Chandrasekhar | Mar 2014 | A1 |
20140351281 | Tunstall-Pedoe | Nov 2014 | A1 |
20150356463 | Overell et al. | Dec 2015 | A1 |
Number | Date | Country |
---|---|---|
WO 2007083079 | Jul 2007 | WO |
Entry |
---|
Aho et al., “Efficient String Matching: An Aid to Bibliographic Search”, Jun. 1975, Communications of the ACM vol. 18 No. 6. |
Macherey, “Statistical Methods in Natural Language Understanding and Spoken Dialogue Systems”, Sep. 22, 2009. |
U.S. Appl. No. 13/711,478, filed Dec. 11, 2012, Lilly et al. |
U.S. Appl. No. 13/896,078, filed May 16, 2013, Tunstall-Pedoe. |
U.S. Appl. No. 13/896,144, filed May 16, 2013, Tunstall-Pedoe. |
U.S. Appl. No. 13/896,611, filed May 17, 2013, Tunstall-Pedoe. |
U.S. Appl. No. 13/896,857, filed May 17, 2013, Tunstall-Pedoe. |
U.S. Appl. No. 13/896,878, filed May 17, 2013, Tunstall-Pedoe. |
U.S. Appl. No. 13/899,171, filed May 21, 2013, Tunstall-Pedoe et al. |
U.S. Appl. No. 13/925,246, filed Jun. 24, 2013, Holmes. |
U.S. Appl. No. 13/925,627, filed Jun. 24, 2013, Tunstall-Pedoe et al. |
Mishra, Taniya et al., “Finite-state models for Speech-based Search on Mobile Devices,” Natural Language Engineering 1 (1), (1998), Cambridge University, United Kingdom. |
Sagae, K. et al., “Hallucinated N-Best Lists for Discriminative Language Modeling.” 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 25-30, 2012: pp. 5001-5004 (paper based on work done as part of a 2011 CLSP summer workshop at Johns Hopkins University). |
Katz, S., “Estimation of probabilities from sparse data for the language model component of a speech recognizer” Acoustics, Speech and Signal Processing, IEEE Transactions on 35, No. 3 (1987): pp. 400-401. |
Gale et al., “Good-Turing Frequency Estimation Without Tears”, Journal of Quantitative Linguistics, vol. 2, No. 3 (1995): pp. 217-237. |
U.S. Appl. No. 14/456,324, filed Aug. 11, 2014, Tunstall-Pedoe. |
U.S. Appl. No. 14/828,176, filed Aug. 17, 2015, Overall et al. |