The present disclosure relates generally to speech recognition and more particularly to speech recognition techniques for recognizing speech when audio signals of differing bandwidths may be required to be recognized.
Speech recognition techniques have evolved to a point where they are used in many mobile communication devices, such as cellular phones carried by people or fixed in vehicles. However, the architecture of present techniques is such that a speech recognizer optimized for a wider band voice signal, such as one presented to the speech recognizer from a microphone, does not provide optimum performance when presented with a narrower band voice signal, such as one presented by a Bluetooth device. Present architectures could optimize performance for both types of signals, but would result in using two speech models and would require almost double the resources of one speech recognizer.
The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views, together with the detailed description below, are incorporated in and form part of the specification, and serve to further illustrate embodiments of concepts that include the claimed invention, and explain various principles and advantages of those embodiments.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
The apparatus and method components have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
Referring to
Similar situations may arise in other environments in which speech recognition is performed. For example a speech recognition system 115 may receive audio that has been passed through a telephone system that restricts the audio to a similarly narrow band, such as from approximately 300 Hz to 3200 Hz. The same speech recognition system 115 may also be intended to process audio that is not bandwidth limited and in fact may be conveyed through a microphone and audio system at present wideband audio extending from below 30 Hz to above 8 kHz to speech recognition system.
The speech recognition system 115 accepts the audio signal 111 coming from either type of audio system 110, that is to say a narrowband audio signal 111 or a wide band audio signal 111 and performs speech recognition using the minimized resources and optimal recognition techniques and presents the results to a user of recognized speech 120. The user of recognized speech 120 may be a function such as a contacts directory, a dialing function, a memo storage function, just to name a few. The speech recognition system 115 and user of recognized speech 120 may be implemented in a cellular telephone, a game box, a remote control such as a TV remote, or any other communication device that accepts voice audio.
Referring to
The narrowband cepstrum transform 215 performs a conventional cepstrum transform using components of the Fourier transform that are within the narrowband frequency range. The cepstrum transform 215 may be a conventional mel frequency cepstrum transform 215. When a conventional frequency cepstrum transform 215 is used, the logarithmic amplitudes of the Fourier transform within the narrow band are mapped onto a conventional mel frequency scale, using triangular overlapping windows. Then a discrete cosine transform is taken of the logarithmic amplitudes so obtained. The discrete cosine transform coefficients 216, commonly referred to as mel frequency cepstrum coefficients, or MFCCs, of which there are typically 13, are coupled to a speech model 230. First and second time derivatives of each MFCC may be determined, as in conventional speech recognition systems, and included with the MFCCs. When a narrow band signal is being processed, these 39 coefficients form a feature vector for each frame, which is calculated using the frequency components only within the narrow band, and is called herein a narrow band feature vector. In certain embodiments the speech model 230 is a hidden Markov model, or HMM, that has been trained as described below. Other Bayesian speech models could be used.
As noted above, the Fourier coefficients 211 are coupled to the out of band transform 220. The out of band transform 220 is set up to have one or more passband filters. Each passband filter selects Fourier transform coefficients within the passband to generate a band energy parameter for the passband. In certain embodiments each passband filter is triangular in shape. The center of each passband filter is outside the narrowband range. Each edge of each passband filter may overlap another passband filter, or may overlap frequency components that are within but near the edges of the narrowband frequency range. The generation of the band energy parameter for a passband comprises determining log(Eri/E) for each passband, wherein i is a passband index, Er is a relative energy of the passband, and E is the energy of the frame. The first and second derivatives are also used, so an energy parameter may comprise three values in certain embodiments. As noted above, one or more energy parameters 221 may be generated since one or more passband filters may be used. In one type of embodiment, the narrow passband range is from 312 Hz to 3062 Hz, and there are two triangular passband filters, one having a frequency range from 62 Hz to 312 Hz and another one having a frequency range from 3062 Hz to 3968 Hz. The six values for these two parameters may be synchronously combined with the 39 MFCCs for the same frame to form an expanded feature vector, in this case having 45 coefficients for each frame of a wideband audio signal, in accordance with certain embodiments.
The one or more parameters are coupled to a switch function 225, which is controlled by a signal 236 that closes the switch 225, coupling the passband parameters 221 to the speech model 230. The passband parameters 221 are coupled to the speech model 230 when a determination has been made that the audio signal 111 is a wideband signal. When such a determination has been made, the signal 236 may be coupled in certain embodiments to the out of band transform 220 to stop it from processing the out of band energy, thereby saving resources such as the energy that otherwise is used to perform the out of band transform, and, when the out of band transform is a computer process, the associated computer resources. The control signal is provided by a wide band detector function 235, which may use one or more signals 211, 221, and 216 to determine when the audio signal 111 is a wideband signal.
Signal 221 comprises the passband parameters determined by filtering and transforming the energy in each passband by the out of band transform 220 according the formula described above. This may be the only signal needed in certain embodiments to determine whether a wide band signal is present. Clearly, when this signal is used by the wide band detector 235, the out of band transform must remain active, so the coupling of control signal 236 to the wide band detector 220 would not be needed.
Signal 211, which includes the Fourier coefficients of the Fourier transform of the frame, may be used by the out of band transform 220 to evaluate those coefficients that are out of the narrow band frequency range. This is useful when it is concluded, during the design cycle, that the determination of the presence of a wideband signal is accomplished more reliably with some other transform of these Fourier coefficients than the one performed by the out of band transform 220, or is accomplished more reliably with some other transform of the Fourier coefficients in combination with the passband parameters 221.
Input signal 216 may be provided in certain embodiments as an information signal that indicates which type of signal the selected audio system provides: narrow band or wide band. When this input signal 216 is provided, the signals 211 and/or 221 are typically not needed and the signal 216 can basically be directly coupled to the switch function 225. In these embodiments, the out of band transform 220 can be deactived by, for example, the signal 236. In a cellular telephone equipped for Bluetooth as well as direct microphone input, the processing system typically stores a state indicating which of these is the source of audio that is being speech recognized. This state may be used as the signal 236 in certain embodiments.
Referring now to
Referring to
Referring to
It will be appreciated that, although the embodiments described so far have been described in terms of a narrow band audio signal and a wide band audio signal, the techniques described are easily adapted by one of ordinary skill in the art to a speech recognition system that handles more than two band widths of audio signals.
It will be appreciated that some embodiments may comprise one or more generic or specialized processors (or “processing devices”) such as microprocessors, digital signal processors, customized processors and field programmable gate arrays (FPGAs) and unique stored program instructions (including both software and firmware) that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the methods and/or apparatuses described herein. Alternatively, some, most, or all of these functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of these two approaches could be used.
Moreover, certain embodiments can be implemented as a computer-readable storage medium having computer readable code stored thereon for programming a computer (e.g., comprising a processor) to perform a method as described and claimed herein. Examples of such computer-readable storage mediums include, but are not limited to, a hard disk, a CD-ROM, an optical storage device, a magnetic storage device, a ROM (Read Only Memory), a PROM (Programmable Read Only Memory), an EPROM (Erasable Programmable Read Only Memory), an EEPROM (Electrically Erasable Programmable Read Only Memory) and a Flash memory. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.
In the foregoing specification, specific embodiments have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present teachings. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.
Moreover in this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “has”, “having,” “includes”, “including,” “contains”, “containing” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises, has, includes, contains a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “comprises . . . a”, “has . . . a”, “includes . . . a”, “contains . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises, has, includes, contains the element. The terms “substantially”, “essentially”, “approximately”, “about” or any other version thereof, are defined as “being close to” as understood by one of ordinary skill in the art, and where they used to describe numerically measurable items, the term is defined to mean within 15% unless otherwise stated. The term “coupled” as used herein is defined as connected, although not necessarily directly and not necessarily mechanically. A device or structure that is “configured” in a certain way is configured in at least that way, but may also be configured in ways that are not listed.
The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in various embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.