1. Field
Exemplary embodiments herein relate to a method and apparatus for performing speech recognition.
2. Description of Related Art
Typically speech recognition is accomplished through the use of an Automatic Speech Recognition (ASR) engine. An ASR works by obtaining audio of a phrase of one or more spoken words, converting the phrase into several potential textual representations, and assigning a confidence score each textual representation.
An ASR can be thought of as an engine and a model. For purposes of this disclosure, a speech engine takes a spoken utterance, compares the utterance to a vocabulary, and matches the utterance to words or phrases in the vocabulary. Speech recognition engines generally require two library files to recognize speech. The first library file is an acoustic model, which is created by taking audio recordings of speech and their associated transcriptions (taken from a speech corpus), and ‘compiling’ the transcriptions into a statistical representations of the sounds that make up each word (through a process called ‘training’). The second set of information is a language model, sometimes referred to a grammar file. A language model may be in the form of a file containing the probabilities of sequences of words. A grammar file is a much smaller type of language model file containing sets of predefined combinations of words. Language models are typically used for dictation applications, whereas a grammar files are generally used in desktop command and control or telephony interactive voice response (IVR) type applications.
Traditionally, both the acoustic model and the language model are static, i.e. they are hard wired as part of the ASR.
Exemplary embodiments of the present application relate to speech recognition using an ASR having a dynamically programmable language model, a dynamically programmable acoustic model, or both the dynamically programmable language model and the dynamically programmable acoustic model.
The speech recognition system in
Exemplary embodiments described herein may increase the accuracy and speed of an automatic speech recognizer (“ASR”) by dynamically updating its language model and the acoustic model.
Speech recognition (by a machine) is a very complex problem. Vocalizations vary in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed. Speech is distorted by background noise, echoes, and electrical characteristics. Accuracy of speech recognition may vary according to: vocabulary size and confusability; speaker dependence vs. independence; isolated, discontinuous, or continuous speech; task and language constraints; read vs. spontaneous speech; and adverse conditions.
Source 120 provides the source of a human speech to the system 110. Source 120 may be a live speaker, the public Internet, a data file, etc. Input 130 is a module configured to receive the human speech and digitize said speech into a machine readable form if the human speech has not already been digitized. ASR 140 is a module configured as an automatic speech recognizer to receive the speech in machine readable form and convert the speech into text. The ASR 140 includes acoustic model 140a and language model 140b.
Acoustic model 140a is a module configured to receive audio recordings of speech, and their text transcriptions, and create statistical representations of the sounds that make up each word. The acoustic model 140a is used by a speech recognition engine to recognize speech. ASR 140 compares the input human speech to the statistical representations of speech contained in the acoustic model 140a to determine the mostly likely textual translation for said speech. Textual translations of speech by the acoustic model 14a are generally in the form of diphones.
Language model 140b is a module configured to assign a probability to a sequence of “m” words by means of a probability distribution, i.e. a language model tries to capture the properties of a language, and to predict the next word in a speech sequence. One the acoustic model 140a has created a sequence of phonemes, ASR 140 uses the language model 140b determines the corresponding words and phrases through various probabilistic models.
As noted above, the ASR 140, acoustic model 140a and the language model 140b may be software modules, hardware modules, or a combination of software and hardware modules, whether separate or integrated, working together to perform their associated functions.
The acoustic model 240a can be user selected, i.e. the user selects the desired acoustic model. The acoustic model 240a may be automatically selected by software onboard depending on the application. For example, if the application expects a speaker with certain characteristics, e.g. accent, cadence, etc, the application may select an acoustic model optimized for said speaker. Further the application can change select different acoustic model 240a as needed.
The language model 340b can be user selected, i.e. the user selects the desired language model. The language model 340b may be automatically selected by software onboard depending on the application. For example, if the application requires command and control functionality, the application may select a grammar file. Alternatively, if the application expects a speaker with certain characteristics, e.g. a regional diction, or the application is being used in technical field , the application may select the language model 340b that is likely to produce the highest quality textual representation. Further the application can different language models 340b as needed.
Both the acoustic model 440a and language model 440b can be user selected, i.e. the user selects the desired language and acoustic models. Additionally, either, both or neither may be automatically selected by software onboard depending on the application. As explained above, if the application expects a speaker with certain characteristics, e.g. accent, cadence, etc., the application may select an acoustic model optimized for said speaker. Similarly, if the application requires command and control functionality, the application may select a grammar file. Alternatively, if the application expects a speaker with certain characteristics, e.g. a regional diction, or the application is being used in technical field, the application may select the language model 440b that is likely to produce the highest quality textual representation. Further the application can different language models 440b as needed.
This patent application claims priority from U.S. Provisional Patent Application No. 61/835,640 filed on Jun. 17, 2013, in the U.S. Patent and Trademark Office, the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61835640 | Jun 2013 | US |