Embodiments of the subject matter described herein relate generally to speech recognition systems. More particularly, embodiments of the subject matter relate to speech recognition for potentially incomplete speech data samples.
During use of push-to-talk devices, it is a common occurrence for users to unintentionally shorten (e.g., cut off or “clip”) a message by pressing the push-to-talk button after speech has begun or releasing the push-to-talk button prior to completing an articulated statement. When a user is communicating with a second user (via the push-to-talk device), the second user can often still understand what the first user was saying, even though the second user did not receive the entire message.
When the user is using a push-to-talk device equipped with speech recognition technology, a shortened or clipped message may cause speech recognition algorithms to fail. Additionally, clipping may occur with automatic gain control systems that do not use push-to-talk technology. For example, if a person begins speaking too quietly, the beginning of a command may be clipped. Clips that remove the first part of the message are detrimental to signal processing algorithms used for speech recognition, to include Hidden Markov Models (HMMs). HMMs evaluate each codeword separately and determine the probability of each codeword based on the codeword that preceded it. If the first codeword of an utterance is clipped, the speech recognition system will most likely be unable to recognize what was spoken, and this can lead to poor speech recognition performance.
Accordingly, it is desirable to provide a method for identifying and interpreting clipped speech using speech recognition technology. Furthermore, other desirable features and characteristics will become apparent from the subsequent detailed description and the appended claims, taken in conjunction with the accompanying drawings and the foregoing technical field and background.
Some embodiments of the present invention provide a method for receiving and analyzing data compatible with voice recognition technology. The method receives speech data comprising at least a subset of an articulated statement; executes a plurality of processes to generate a plurality of probabilities, based on the received speech data, each of the plurality of processes being associated with a respective candidate articulated statement, and each of the generated plurality of probabilities comprising a likelihood that an associated candidate articulated statement comprises the articulated statement; and analyzes the generated plurality of probabilities to determine a recognition result, wherein the recognition result comprises the articulated statement.
Some embodiments provide a system for receiving data compatible with speech recognition technology. The system includes a user input module, configured to receive a set of audio data; a data analysis module, configured to: calculate one or more probabilities based on the received speech data, each of the calculated plurality of probabilities indicating a statistical likelihood that the set of audio data comprises a candidate word; and determine a speech recognition result, based on the calculated plurality of probabilities.
Some embodiments provide a non-transitory, computer-readable medium containing instructions thereon, which, when executed by a processor, perform a method. In response to a received set of user input compatible with speech recognition (SR) technology, the method: executes a plurality of multi-threaded processes to compute a plurality of probabilities, each of the plurality of probabilities being associated with a respective one of the plurality of multi-threaded processes; compares each of the plurality of probabilities to identify one or more probabilities above a predefined threshold; and presents a recognition result, based on the identified one or more probabilities above the predefined threshold.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A more complete understanding of the subject matter may be derived by referring to the detailed description and claims when considered in conjunction with the following figures, wherein like reference numbers refer to similar elements throughout the figures.
The following detailed description is merely illustrative in nature and is not intended to limit the embodiments of the subject matter or the application and uses of such embodiments. As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Any implementation described herein as exemplary is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, there is no intention to be bound by any expressed or implied theory presented in the preceding technical field, background, brief summary or the following detailed description.
The subject matter presented herein relates to methods and apparatus used to interpret received speech data, whether the speech data is a complete or incomplete statement. A statement articulated by a user conveys a set of speech data. The set of received speech data may have been “clipped” or cut off during articulation, or in other words, the received set of speech data may be incomplete due to an omitted portion. The omitted portion may include one or more whole words, phonemes, codewords, or other defined portion of an utterance. A system executes a plurality of signal processing algorithms used for speech recognition, to calculate probabilities associated with: (i) the received speech data being associated with a complete statement, and (ii) the received speech data being associated with an incomplete statement due to a clipped portion.
In the context of this application, the terms “speech recognition” and “voice recognition” are interchangeable. Further, the terms “speech data” and “voice data” are also interchangeable. A sample or set of speech data includes at least one word. One or more words are stored individually, in a system Dictionary. Each word comprises one or more phonemes, which may be defined as any of the perceptually distinct units of sound in a specified language that distinguish one word from another. Phonemes may include, but are not limited to, distinct units of sound associated with the English language. Phonemes provide a phonetic representation of a subset of each word, which may include a portion of the word, up to and potentially including the entire word. Each phoneme may be associated with one or more codewords, or subphonetic representations of portions of a word. Further, words may be referenced using a system Language Model, to retrieve probabilities that individual words and/or word combinations may occur in a received set of speech data.
Referring now to the drawings,
The speech data recognition system 100 may include, without limitation: a processor architecture 102; a system memory 104; a user interface 106; a signal processing module 108; a system preparation module 110; a parameter module 112; and a data analysis module 114. In practice, an embodiment of the speech data recognition system 100 may include additional or alternative elements and components, as desired for the particular application. For example, additional components such as displays and user input components may be employed without departing from the scope of the present disclosure. For ease of illustration and clarity, the various physical, electrical, and logical couplings and interconnections for these elements and features are not depicted in
The processor architecture 102 may be implemented using any suitable processing system, such as one or more processors 110 (e.g., multiple chips or multiple cores on a single chip), controllers, microprocessors, microcontrollers, processing cores and/or other computing resources spread across any number of distributed or integrated systems, including any number of “cloud-based” or other virtual systems.
The processor architecture 102 is in communication with system memory 104. The system memory 104 represents any non-transitory short or long term storage or other computer-readable media capable of storing programming instructions for execution on the processor architecture 102, including any sort of random access memory (RAM), read only memory (ROM), flash memory, magnetic or optical mass storage, and/or the like. It should be noted that the system memory 104 represents one suitable implementation of such computer-readable media, and alternatively or additionally, the processor architecture 102 could receive and cooperate with external computer-readable media that is realized as a portable or mobile component or application platform, e.g., a portable hard drive, a USB flash drive, an optical disc, or the like.
The user interface 106 accepts information from a user of the speech data recognition system 100, including speech data and information necessary to receive and recognize speech data. User interface 106 may include any means of transmitting user input into the speech data recognition system 100, to include without limitation: a microphone, a push-to-talk or push-to-transmit (PTT) device, a push-to-talk over cellular (PoC) device, or other input device capable of receiving audio data. The user interface 106 may further include a computer keyboard, mouse, touch-pad, trackball, a touch-screen device; and/or other input device.
The signal processing module 108 is suitably configured to analyze received speech data to obtain a set of recognized codewords. To accomplish this, the signal processing module 108 can utilize continuous to discrete signal conversion techniques for signal processing (e.g., fast Fourier transforms (FFT), linear predictive coding (LPC), filter banks, etc.) to generate quantized feature vector representations of the received speech data. The signal processing module 108 is also configured to predefine a set number of quantization vectors, or codewords, based on this quantization process. During the quantization process, the signal processing module 108 transforms continuous signals into discrete signals (e.g., codewords).
The system preparation module 110 is configured to determine and store a probabilistic relationship between a codeword, recognized by the signal processing module 108, and one of the phonemes associated with a particular language. In certain embodiments, phonemes utilized by the speech data recognition system 100 are associated with the English language. In some embodiments, the speech data recognition system 100 utilizes phonemes associated with a non-English language. Generally, each phoneme is associated with a plurality of codewords. The system preparation module 110 determines the probabilistic relationship between a recognized codeword and a particular phoneme using a plurality of received samples of a particular phoneme.
The parameter module 112 is configured to constrain operation of the speech data recognition system 100 by limiting the interpretations of the received speech data to a set of predefined possibilities retained in system memory 104, generally referred to as a speech data recognition system 100 Dictionary. The Dictionary may include words and/or groups of words, and their corresponding phonemes. Each word in the Dictionary includes one or more “component” phonemes, representing each enunciated sound during articulation of the word. The parameter module 110 can: (i) communicate with the system preparation module 110 to obtain phonemes of a set of received speech data, wherein each phoneme is probabilistically related to a group of received codewords; and (ii) compare the phonemes associated with the received speech data with phonemes associated with words stored in the dictionary, and (iii) limit the candidate words, and their component phonemes, that are further evaluated by the data analysis module 114 (described in more detail below).
The parameter module 110 is further configured to constrain operation of the speech data recognition system 100 by limiting the interpretations of the received speech data contextually, using a Language Model, which is also retained in system memory 104. The Language Model is used to predict the probability of the next word in an utterance, given the previous word spoken. It can be used to identify the probability that a word (and its component phonemes) or a group of words (and their component phonemes) occurs in a set of speech data. The parameter module 110 may identify a limited set of potential words from the Dictionary (and their corresponding phonemes) that may be applicable to the received set of speech data.
The data analysis module 114 is suitably configured to determine the probability that a particular string of phonemes (each phoneme associated with one or more codewords) corresponds to a set of received speech data. In certain embodiments, the set of received speech data includes a complete articulated statement, or in other words, a complete set of speech data. In this situation, the data analysis module 114 is configured to determine a probability that a particular string of phonemes corresponds to the set of received speech data. In certain embodiments, the set of received speech data includes an incomplete portion of a complete set of speech data, wherein the complete set of speech data is not received due to an error (e.g., user error, system error, etc.). In this situation, the data analysis module 114 is configured to determine a probability that a particular string of phonemes corresponds to the complete set of speech data.
The data analysis module 114 can execute hidden Markov models (HMMs) to calculate the probability that a sequence of phonemes corresponds to a complete set of speech data, wherein the received set of speech data comprises at least a subset or portion of a complete set of speech data. In certain embodiments, one of the sequence of phonemes is probabilistically related to one or more recognized codewords from a set of received speech data. In some embodiments, the sequence of phonemes may include only recognized phonemes from the set of received speech data. However, in some embodiments, in addition to the recognized phonemes from the set of received speech data, the sequence of phonemes also includes one or more additional phonemes to complete the received set of speech data.
In exemplary embodiments, the data analysis module 114 is capable of executing HMMs to calculate the probability that a sequence of phonemes corresponds to a complete set of speech data, as described above. However, in some embodiments, the data analysis module 114 may use other techniques that are capable of temporal pattern recognition, to include neural networks.
The data analysis module 114 is further configured to determine a probability that a particular string of phonemes can be used in a correct word combination applicable to a candidate word; and, when more than one candidate string of phonemes can correspond to a received set of speech data, to compare the probabilities to determine a specified number of options.
The data analysis module 114 is configured to execute a number of processes, each of the processes including at least one Hidden Markov Model (HMM). Each process represents a particular number of potentially omitted phonemes. For example, in one scenario, the voice data recognition system 100 may be configured to perform analysis relating to zero (0) clipped phonemes, one (1) clipped phoneme, and two (2) clipped phonemes. In another scenario, the speech data recognition system 100 may be configured to perform analysis relating to zero (0) clipped phonemes, one (1) clipped phoneme, two (2) clipped phonemes, and three (3) clipped phonemes. A speech data recognition system 100 may be configured to perform analysis for any desired number of clipped phonemes, but larger numbers of executed processes for increased numbers of clipped phonemes produce probabilities which progressively lose their accuracy and greatly increase processing requirements.
Each executed process, associated with a particular number of potentially omitted phonemes, includes one or more Hidden Markov Models (HMMs). Each HMM is executed to determine the probability that a particular string of phonemes corresponds to the set of received speech data. Once executed, the HMMs generate a set of data, including a plurality of probabilities, each probability associated with a particular string of one or more phonemes (including known phonemes and unknown candidate phonemes) can be used in a correct word and/or word combination applicable to a candidate articulated statement. Each HMM produces a list of words and/or phrases that were potentially articulated by a user (and consequently, at least partially received by the speech recognition system 100) and each of the words or phrases on the list is associated with a probability of its occurrence. The resultant probabilities from all HMMs are compared to determine a most likely word or phrase that was spoken, or in other words, a recognition result.
In practice, the signal processing module 108, the system preparation module 110, the parameter module 112, and the data analysis module 114 may be implemented with (or cooperate with) the processor architecture 102 to perform at least some of the functions and operations described in more detail herein. In this regard, signal processing module 108, the system preparation module 110, the parameter module 112, and the data analysis module 114 may be realized as suitably written processing logic, application program code, or the like.
It should be noted that clips could also occur in situations where Automatic Gain Control is used. In certain embodiments using Automatic Gain Control, the process 200 is continuously “listening” for a user to articulate a set of speech data, and an indication of the point in time at which the process 200 begins to receive speech data is not required. In some embodiments, push-to-talk or keyword technology may also be used. For Automatic Gain Control scenarios, if a first portion of the articulated speech data is spoken quietly or there is an increased amount of audio interference, the speech data may be “clipped”. Here, a portion of the received speech data may not be appropriately received and interpreted, and the received set of speech data is rendered incomplete.
Next, the process 200 executes a plurality of processes to generate a plurality of probabilities based on the received speech data, each of the generated plurality of probabilities comprising a likelihood that an associated candidate articulated statement comprises the articulated statement (step 204). In certain embodiments, the plurality of processes is executed in a multi-threaded fashion, performing the analysis associated with each process simultaneously. Each process may perform analysis for a designated quantity of clipped or omitted speech data, and each process may include one or more Hidden Markov Models (HMMs) corresponding to the quantity of omitted speech data. The generated probabilities are associated with each HMM, including probabilities directly associated with specified quantities of omitted voice data (e.g., omitted strings of codewords).
The process 200 then analyzes the generated plurality of probabilities to determine a recognition result, wherein the recognition result comprises at least one candidate articulated statement associated with a respective one of the plurality of probabilities indicating that the articulated statement comprises the at least one candidate articulated statement (step 206). Generally, a threshold probability value is designated as a minimum calculated probability indicating that a string of phonemes comprises an articulated statement. In certain embodiments, a specific result is recognized and presented to the user for verification. In some embodiments, more than one result may be recognized. In this case, more than one calculated probability is a value above a predefined threshold.
Next, the process 300 compares a first phoneme of the received set of speech data to one or more candidate words stored in a system dictionary (step 304). An embodiment of step 304 is presented in
Next, the process 400 utilizes stored probability relationships between codewords and associated phonemes to determine a sequence of phonemes associated with the sequence of codewords (step 404). Following system preparation (see embodiment illustrated in
After determining a sequence of phonemes associated with the sequence of received codewords (step 404), the process 400 recognizes a first phoneme of the sequence of phonemes (step 406). Once the first phoneme of the sequence of phonemes has been recognized (step 406), the process 400 compares the first phoneme to a plurality of candidate first phonemes, each of the plurality of candidate first phonemes being associated with a respective candidate word stored in a system dictionary (step 408). The system dictionary includes stored candidate words and, for each candidate word, a plurality of phonemes associated with each stored word. The first determined phoneme is associated with the first codeword, or first group of codewords, in the sequence of received speech data. The first determined phoneme is compared to a first sequential phoneme for a plurality of candidate words stored in the system dictionary.
Returning now to
Once the single speech recognition algorithm has been executed (step 308), the resulting calculated probability is compared to a predetermined probability threshold (step 310). When the calculated probability is above the predetermined probability threshold (the “Yes” branch of 310), the process 300 returns a solution (step 312). Here, the solution is a string of phonemes associated with the calculated probability, and for which the speech recognition algorithm was executed in step 308. When the calculated probability is not above the predetermined probability threshold (the “No” branch of 310), the process 300 assumes an incomplete set of speech data and executes a plurality of speech recognition algorithm based on a predefined number of omitted phonemes (step 314).
However, when the first phoneme does not match the first phoneme of at least one candidate word in the system dictionary (the “No” branch of 306), the process 300 assumes that the received set of speech data is incomplete and executes a plurality of speech recognition algorithms based on a predefined number of omitted phonemes. An embodiment of step 314 is presented in
Assuming one clipped phoneme, the process 502 compares a first interpreted phoneme to a second phoneme for each word stored in the system dictionary (step 504). If there is no match (the “No” branch of 506), then the process 502 assuming one clipped phoneme ends (or fails), and no probability will be calculated based on the condition of one clipped phoneme. If there are one or more words in the system dictionary that have a second phoneme that matches the first interpreted phoneme from the set of received speech data (the “Yes” branch of 506), then the process 502 recognizes the matching words (step 510). Here, there are X number of matching words, and X may be greater than or equal to one.
After recognizing X matching words from the system dictionary (step 510), the process 502 populates a database with X values, each value corresponding to the first phoneme of one of the matching words (step 512). An embodiment of the concepts involved in steps 504, 506, 510, and 512 are illustrated in
Returning to
Here, the first group 720 includes a maximum of A possibilities of a single phoneme that has been clipped from an utterance. The second group 730 includes a maximum of B possibilities of a series of two phonemes that have been clipped from the utterance. The third group 740 includes a maximum of C possibilities of a series of n phonemes that have been clipped from the utterance. For purposes of this example, the ellipsis may represent a maximum of D possibilities of a series of phonemes that have been clipped from the utterance, wherein the D possibilities include all possibilities assuming that the number of clipped phonemes is more than two clipped phonemes, but less than n clipped phonemes.
Returning to
Returning to
The comparison of the first phoneme of the received set of speech data to one or more candidate words stored in a system dictionary (step 306) is employed for purposes of potentially decreasing processing requirements by eliminating some otherwise necessary sub-processes. However, in certain embodiments of
Next, the process 800 identifies quantization vectors associated with each of the set of overlapping feature vectors (step 804). After identifying quantization vectors associated with each of the set of overlapping feature vectors (step 804), the process 800 recognizes a codeword linked to each quantization vector (step 806). Here, during the quantization process, the process 800 transforms continuous signals into discrete signals (e.g., codewords).
Next, the process 900 recognizes and stores a plurality of codewords, based on the received plurality of speech data samples (step 904). This process is described above with regard to
After recognizing and storing a plurality of codewords (step 904), the process 900 creates and stores a plurality of probability relationships, each of the probability relationships relating a respective one of the plurality of codewords to the particular phoneme (step 906). From the received plurality of speech samples, the process 900 determines a likelihood for a particular codeword to appear in a specific phoneme. These probability relationships are computed and then stored for use in speech recognition. Generally, these probability relationships are stored in a list that is populated by a list of words that are used as part of a speech command, and each word is associated with its one or more component phonemes.
Techniques and technologies may be described herein in terms of functional and/or logical block components, and with reference to symbolic representations of operations, processing tasks, and functions that may be performed by various computing components or devices. Such operations, tasks, and functions are sometimes referred to as being computer-executed, computerized, software-implemented, or computer-implemented. In practice, one or more processor devices can carry out the described operations, tasks, and functions by manipulating electrical signals representing data bits at memory locations in the system memory, as well as other processing of signals. The memory locations where data bits are maintained are physical locations that have particular electrical, magnetic, optical, or organic properties corresponding to the data bits. It should be appreciated that the various block components shown in the figures may be realized by any number of hardware, software, and/or firmware components configured to perform the specified functions. For example, an embodiment of a system or a component may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look-up tables, or the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
When implemented in software or firmware, various elements of the systems described herein are essentially the code segments or instructions that perform the various tasks. The program or code segments can be stored in a processor-readable medium or transmitted by a computer data signal embodied in a carrier wave over a transmission medium or communication path. The “processor-readable medium” or “machine-readable medium” may include any medium that can store or transfer information. Examples of the processor-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette, a CD-ROM, an optical disk, a hard disk, a fiber optic medium, a radio frequency (RF) link, or the like. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic paths, or RF links. The code segments may be downloaded via computer networks such as the Internet, an intranet, a LAN, or the like.
Some of the functional units described in this specification have been referred to as “modules” in order to more particularly emphasize their implementation independence. For example, functionality referred to herein as a module may be implemented wholly, or partially, as a hardware circuit comprising custom VLSI circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, or the like.
Modules may also be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more physical or logical modules of computer instructions that may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different locations that, when joined logically together, comprise the module and achieve the stated purpose for the module.
Indeed, a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Similarly, operational data may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, merely as electronic signals on a system or network.
While at least one exemplary embodiment has been presented in the foregoing detailed description, it should be appreciated that a vast number of variations exist. It should also be appreciated that the exemplary embodiment or embodiments described herein are not intended to limit the scope, applicability, or configuration of the claimed subject matter in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing the described embodiment or embodiments. It should be understood that various changes can be made in the function and arrangement of elements without departing from the scope defined by the claims, which includes known equivalents and foreseeable equivalents at the time of filing this patent application.