The present invention relates to a microphone assembly comprising a phoneme recognizer. The phoneme recognizer comprises an artificial neural network (ANN) comprising at least one phoneme expect pattern and a digital processor configured to repeatedly applying one or more sets of frequency components derived from a digital filter bank to respective inputs of an artificial neural network. The artificial neural network is configured to detect and indicate a match between the at least one phoneme expect pattern and the one or more sets of frequency components.
Portable communication and computing devices such as smartphones, mobile phones, tablets etc. are compact devices which are powered from rechargeable battery sources. The compact dimensions and battery source both put severe constraints on the maximum acceptable dimensions and power consumption of microphones and microphone amplification circuit utilized in such portable communication devices.
Voice activity detection (VAD) approaches and acoustic activity detection (AAD) approaches are important components of speech recognition software and hardware of such portable communication devices. For example, speech recognition applications running on an application or host processor, e.g. a microprocessor, of the portable communication device, may constantly scan the audio signal generated by a microphone searching for voice activity, usually, with an MIPS intensive voice activity recognition algorithm. Since the voice activity algorithm is constantly running on the host processor, the power used in this voice detection approach is significant. Microphones disposed in portable communication devices such as cellular phones often have a standardized interface to the host processor to ensure compatibility with this interface of the host processor.
In order to enable a voice recognition feature at all times, the power consumption of the overall solution must be small enough to have minimal impact on the total battery life of the portable communication device. As mentioned, this has not occurred with existing devices.
Because of the above-mentioned problems, some user dissatisfaction with previous approaches has occurred. There is a need for microphone assemblies comprising a phoneme recognizer which in addition to recognizing voice activity of the incoming voice or speech signal is capable of recognizing a specific phoneme or a specific sequence of phonemes representing a key word or key phrase.
A first aspect of the invention relates to a microphone assembly comprising a transducer element configured to convert sound into a microphone signal and a housing supporting the transducer element and a processing circuit. The processing circuit comprising:
The transducer element may comprise a capacitive microphone for example comprising a micro-electromechanical (MEMS) transducer element. The microphone assembly may be shaped and sized to fit into portable audio and communication devices such as smartphones, tablets and mobile phones etc. The transducer element may be responsive to both impinging audible sound.
The artificial neural network may comprise a plurality of input memory cells such as RAM, registers, FFs, etc., one or more output neurons and a plurality of internal weights disposed in-between the plurality of input memory cells and each of the one or more output neurons. The plurality of internal weights are configured or trained for representing the at least one phoneme expect pattern by a network training session. Likewise, respective connections between the plurality of internal weights and the one or more output neurons are determined during the network training session to define phoneme configuration data for the ANN representing the at least one phoneme expect pattern as discussed in further detail below with reference to the appended drawings.
The digital processor may comprise a state machine and/or a software programmable microprocessor such as a digital signal processor (DSP).
A second aspect of the invention relates to a method of detecting at least one phoneme of a key word or key phrase in a microphone assembly. The method at least comprising:
A third aspect of the invention relates to a semiconductor die comprising the processing circuit according to any of the above-described embodiments thereof. The processing circuit may comprise a CMOS semiconductor die. The processing circuit 105 may be shaped and sized for integration into a miniature MEMS microphone housing or package.
A fourth aspect of the invention relates to a portable communication device comprising a transducer assembly according to any of the above-described embodiments thereof. The portable communication device may comprise an application processor, e.g. a microprocessor such as a Digital Signal Processor. The application processor may comprise a data communication interface compliant with, and connected to, an externally accessible command and control interface of the microphone assembly. The data communication interface may comprise an industry standard data interface such as I2C, USB, UART, Soundwire or SPI. Various types of configuration data of the processing circuit for example for programming or adapting the artificial neural network and/or the digital filter bank may be transmitted from the application processor to the microphone assembly as discussed in further detail below with reference to the appended drawings.
Embodiments of the invention are described in more detail below in connection with the appended drawings in which:
The skilled artisans will appreciate that elements in the appended figures are illustrated for simplicity and clarity. It will further be appreciated that certain actions and/or steps may be described or depicted in a particular order of occurrence while those skilled in the art will understand that such specificity with respect to sequence is not actually required. It will also be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein.
Approaches, microphone assemblies and methodologies are described herein that recognize a particular phoneme and/or recognize a predetermined sequence of phonemes representing a key word or key phrase using a phoneme recognizer. The phoneme recognizer may comprise an artificial neural network (ANN) and a digital filter bank that both can be individually programmable or configurable via an externally accessible command and control interface of the microphone assembly.
As used herein, a “phoneme” is an abstraction of a set of equivalent speech sounds or “phones”. In some embodiments, the microphone assembly detects a particular key word or key phrase by detecting the corresponding sequence of phonemes representing the key word or key phrase. The present microphone assembly may form part of an “always on” speech recognition system integrated in a portable communication device. The present microphone assembly may reduce system power consumption by robustly triggering on the key word or key phrase in a wide range of ambient acoustic interferences and thereby minimize false trigger events caused by the detection of isolated phonemes uttered in an incorrect sequence. In some exemplary embodiments of the present approaches, microphone assemblies and methodologies may be tuned or adapted to different key words or key phrases and also in turn tuned to a particular user through configurable parameters as discussed in further detail below. These parameters may be loaded into suitable memory cells of the microphone assembly on request via the configuration data discussed above, for example, using the previously mentioned command and control interface. The latter may comprise a standardized data communication interface such as I2C, UART and SPI.
The processing circuit 105 further comprises a power supply 108, the specialized key word or key phrase recognizer (KWR) 110, a buffer 112, a PDM or PCM interface 114, a clock line 116, a data line 118, a status control module 120, and a command/control interface 122 configured for receiving commands or control signals 124 transmitted from an external application processor of the portable communication device. The structure, features and functionality of the key word recognizer (KWR) 110 is discussed in further detail below. The buffer 112 is configured to temporarily store audio samples of the multi-bit digital signal generated by the analog-to-digital converter 104. The buffer 112 may comprise a FIFO buffer configured to temporarily store a time segment of audio samples corresponding to 100 ms to 1000 ms of the microphone signal. The key word recognizer (KWR) 110 may repeatedly read one or more successive time frames from the buffer 112 and process these to detect the key word or phrase as discussed below in more detail.
The clock line 116 of the PDM or PCM interface 114 receives an external clock signal from an external processing device, such as the host processor discussed above, to the microphone assembly 100. In one aspect, the external clock signal on the clock line 116 is supplied in response to detection of the key word or phrase. The data line 118 is used to transmit the segment of the multi-bit digital signal (i.e. audio samples) stored in the buffer 112 to the host processor—for example encoded as a PCM signal or PCM data stream. The number of audio samples stored in the buffer may correspond to a time period or duration of the microphone signal between 100 ms and 1 second such as between 250 ms and 800 ms. The skilled person will understand that a large storage capacity of the buffer 112 for storage of a large number of audio samples occupies a large memory area on a semiconductor chip on which electronic components and circuits of the microphone assembly is integrated. In one aspect of the invention, the buffer 112 comprises a downsampler reducing the sampling frequency of incoming audio data stream from a first sampling frequency to a second, and lower, sampling frequency. In this manner, the memory area of the buffer 112 is reduced for a given time period of the microphone signal. The first sampling frequency may for example be 16 kHz and the second sampling frequency 8 kHz. This embodiment of the buffer 112 is discussed in further detail below with reference to
The status control module 120 signals, flags or indicates the detection of the key word or key phrase in the microphone signal to the host processor through a separate and externally accessible pad or terminal 126 of the microphone assembly. The externally accessible pad or terminal 126 may for example be mounted on a certain portion or component of the housing of the assembly. The status control module 120 may be configured to flag the detection of the key word in numerous ways for example by a logic state transition or logic level shift of the associated pad or terminal 126. The host processor may be connected to the externally accessible pad 126 via a suitable input port for reading the status signalled by the pad 126. The input port of the host processor may comprise an interrupt port such that the key word flag will trigger an interrupt routine executing on the host processor and awaking the latter from a sleep-mode or low-power mode of operation. In one embodiment, the status control module 120 outputs a logic “1” or “high” in response to the detection of the key word on the pad 126. The skilled person will understand that other embodiments of the microphone assembly may be configured to signal or flag the detection of the key word or key phrase in the microphone signal to the host processor through the command/control interface 122 discussed below. In the latter embodiment, the key word recognizer 110 may be coupled to the command/control interface 122 such that the latter generates and transmits a specific data message to the host processor indicating a key word detection.
The command/control interface 122 receives data commands 124 from the host processor and may additionally transmit data commands to the host processor in some embodiments as discussed above. The command/control interface 122 may include a separate clock line that clocks data on a data line of the interface. The command/control interface 122 may comprise a standardized data communication interface according to e.g. 120, USB, UART or SPI. The microphone assembly 100 may receive various types of configuration data transmitted by the host processor. The configuration data may comprise data concerning a configuration and internal weight settings of an artificial neural network (ANN) per phoneme of the key phrase of the key word recognizer 110. The configuration data may additionally or alternatively comprise data concerning characteristics of a digital filter bank of the key word recognizer 110 as discussed in further detail below.
The skilled person will understand that numerous different types of digital filter banks may be used to divide or split the multi-bit/PCM digital signal into the frequency components. In some embodiments, the digital filterbank 301 may comprise a FFT based filter dividing the multibit digital signal into a certain number of linearly spaced frequency bands. In other embodiments, the digital filterbank 301 may comprise a set of adjacent bandpass filters dividing the multibit digital signal into a certain number of logarithmically spaced frequency bands. An exemplary embodiment of the digital filterbank 301 is depicted on
The artificial neural network 400 may comprise 10 or less neurons in some embodiments. These ANN specifications provide a compact artificial neural network 400 operating with relatively small power consumption and using a relatively small amount of hardware resources, such as memory cells, making the artificial neural network 400 suitable for integration in the present microphone assemblies. The training of the artificial neural network 400 may be carried out by a commercially available software package such as the Neural Network Toolbox™ available from The MathWorks, Inc. After the training of the artificial neural network 400, the respective phoneme configuration data may be downloaded to the key word recognizer 110 via the command/control interface 122 as respective phoneme expect patterns of the predetermined sequence of phoneme expect patterns. The key word recognizer 110 may therefore comprise a programmable key word or key phrase feature where the sequence of phoneme expect patterns is stored as configuration data in rewriteable memory cells of the artificial neural network 400 such as flash memory, EEPROM, RAM, register files or flip-flops. The key word or key phrase may be programmed into the artificial neural network 400 via data commands comprising the phoneme configuration data. The key word recognizer may receive these phoneme configuration data through the previously discussed command and control interface 122 (please refer to
The sequence of phoneme expect patterns forming the key word or key phrase may alternatively be programmed into the artificial neural network 400 in a fixed or permanent manner for example as a metal layer of a semiconductor mask of the processing circuit 105.
In the following exemplary embodiments of the artificial neural network 400, the key word/phrase to be recognized is ‘OK Google’, but the skilled person will understand that the artificial neural network 400 may be trained to recognize appropriate phoneme expect patterns of numerous alternative key words or phrases using the techniques discussed above.
The upper spectrogram 501 of
The predetermined sequence of individual phonemes for the key phrase ‘OK Google’= is depicted above as the upper spectrogram 501 inside frame 505. In order to recognize the key phrase, the artificial neural network 400 has been trained by multiple speakers, for example pronouncing the key phrase multiple times such as 25 times, and the weights and neurons connections of the artificial neural network 400 are adjusted accordingly to form the sequence of phoneme expect patterns modelling the target or desired sequence of phonemes representing the key word or key phrase. In one embodiment of the artificial neural network 400, the neurons and connections are configured to recognize a single phoneme of the target sequence of phonemes at a time to save computational hardware resources as discussed below. The digital filter bank generates successive sets of normalized power/energy estimates of the frequency components 1-7 for each 10 ms time frame of the multibit digital signal. A current set of normalized power/energy estimates are stored in a FIFO buffer 401 of the artificial neural network 400 as indicated by buffer cells N1(n), N2(n), N3(n) etc. until N7(n) where index n indicates that the set of normalized power/energy estimates belongs to the frequency components of a current time frame. The FIFO buffer 401 also holds a plurality of sets of normalized power/energy estimates of frequency components belonging to the previous time frames of the multibit digital signal where cells N1(n−1), N2(n−1), N3(n−1) etc. illustrate individual normalized power/energy estimates of the time frame immediately preceding the time frame n. Likewise, cells N1(n−2), N2(n−2), N3(n−2) etc. illustrate individual normalized power/energy estimates of the time frame immediately preceding time frame n−1 and so forth for the total number of time frames represented in the FIFO buffer 401. One embodiment of the FIFO buffer 401 of the artificial neural network 400 may simultaneously store six sets of normalized power/energy estimates representing respective ones of six successive time frames (including the current time frame) of the multibit digital signal corresponding to a 60 ms segment of the multibit digital signal. The FIFO buffer 401 shows only the three-four most recent time frames frame n, n−1 and n−2 for simplicity. The six sets of normalized power/energy estimates held in the FIFO buffer 401, i.e. total of 6*7=42 normalized power/energy estimates for the present embodiment, are applied to a corresponding number of input cell or memory elements 403 of the artificial neural network 400. The memory elements 403 may comprise flip-flops, RAM cells, register files etc. These six sets of normalized power/energy estimates are compared with a first phoneme expect pattern modelling the first phoneme ‘oυ’ of the target phrase.
This first phoneme expect pattern is loaded into the artificial neural network 400 during initialization of the key word recognizer 110 of the artificial neural network 400. Due to the operation of the FIFO buffer 401, a new set of normalized power/energy estimates of the frequency components, corresponding to a new 10 ms time frame, of the multibit digital signal is regularly loaded into the FIFO buffer 401 while the oldest set of normalized power/energy estimates is discarded. Thereby, the artificial neural network 400 will repeatedly compare the first phoneme expect pattern (‘oυ’) with the successive sets of frequency components, as represented by the respective sets of normalized power/energy estimates, held in the FIFO buffer 401. Once a current sample of the six sets of normalized power/energy estimates N1(n), N2(n), N3(n) etc. held in the memory elements 403 matches the first phoneme expect pattern, the output, OUT, of the artificial neural network 400 changes state so as to flag or indicate the detection of the first phoneme expect pattern. Once, the first phoneme has been detected, the key word recognizer 110 proceeds to skip the current, i.e. still first, phoneme expect pattern and load a second phoneme expect pattern into the artificial neural network 400. This may be accomplished by adjusting, or loading new weights into the network 400 and reconfigure the respective connections between weights and the neurons. The second phoneme expect pattern corresponds to the second phoneme 'kei of the target phoneme sequence. The switch between the different phoneme expect patterns associated with the target key word is carried out by a digital processor. The digital processor of the present embodiment uses a state machine 600 (refer to
representing the key phrase. The respective phoneme expect patterns or masks associated with the four internal states 601, 603, 605, 607 are illustrated as Mask 1-4 below the internal state symbols 601, 603, 605, 607. During operation of the network, the state machine 600 resides in the first internal state 601 monitoring the microphone signal as illustrated by the “No” repetition arrow 611 until the first phoneme has been detected in the incoming microphone signal. In response to the detection of the first phoneme, the state machine 600 proceeds to the second internal state 603 as illustrated by the “Yes” arrow exiting the first state 601. The state machine 600 thereafter resides in the second internal state 603 monitoring the incoming microphone signal for the second phoneme 'kei as illustrated by the “No” repetition arrow until the second phoneme is detected in the incoming microphone signal. In response to detection of the second phoneme within the incoming microphone signal, the state machine 600 proceeds to the third internal state 605 as illustrated by the “Yes” arrow leading out of the second state 603. However, the state machine 600 may further add a time constraint or time window for the detection of the second phoneme during the second internal state 603 as illustrated by comparison box 613. This time window is helpful to ignore false/unrelated detections of the second phoneme under conditions where a time delay between the first phoneme detection and the second phoneme detection is too long to make the phonemes part of the same key word or key phrase. For example if this time delay is larger than one second or several seconds it suggests that the occurrence of the second phoneme is made in another context than the pronunciation of the key phrase or word. In other words, the time constraint or time window ensures the existence of an appropriate timing relationship between the occurrence of the first and second phonemes, or any other pair of successive phonemes of the key phrase, consistent with normal human speech production. Therefore, verifying or ensuring that the pair of successive phonemes really is part of the same key word or phrase. The length of the time window associated with the second internal state 603 is X2 as indicated inside comparison box 613. The length of X2 may be less than 500 ms such as less than 300 ms measured from the detection of the first phoneme. Hence, the state machine 600 may be configured to reside in the second internal state 603 at the most for the 500 ms time window, e.g. between 0 ms and 500 ms. If the duration, t2, of the second internal state 603 exceeds 500 ms, the result of the time window test carried out in comparison box 613 becomes yes and the state machine reverts or jumps to the first internal state 601 as illustrated by arrow 615. On the other hand, if the second phoneme is detected within the time window t2, the state machine 600 proceeds to the third internal state 605 as mentioned above. The state machine 600 thereafter resides in the third internal state 605 monitoring the incoming microphone signal for the third phoneme 'gu as illustrated by the “No” repetition arrow until either the third phoneme is detected or a second time window constraint, t3, operating similar to time window constraint discussed above expires. The length of the second time window, t3, associated with the third internal state 605 may be similar to the length of the time window t2 of the second state discussed above, or it may be different depending on the language specifics of the sought after key phrase or key word. Hence, the state machine 600 may be configured to reside in the third internal state 605 for at the most the duration of the second time window t3 and revert to the first internal state 601 if the third phoneme remains undetected within the second time window t3 as illustrated by arrow 617. In contrast if the third phoneme is detected within the second time window, the state machine 600 in response proceeds to the fourth internal state 607 as illustrated by the “Yes” arrow leading out of the third state 605.
The state machine 600 thereafter resides in the fourth internal state 607 for a maximum period corresponding to a third time window t4 monitoring the incoming microphone signal for the fourth phoneme “gal” as illustrated by the “No” repetition arrow circling through comparison box 618 until either the fourth phoneme is detected or the third time window expires in a similar manner to the third internal state discussed above. If the fourth phoneme remains undetected within the third time window t4, the state machine 600 reverts or jumps in response to the first internal state 601 as illustrated by arrow 619. Alternatively, if the fourth phoneme is detected within the third time window t4, the state machine 600 determines that the sought after sequence of the four individual phonemes representing the key phrase has been detected. In response, the state machine 600 proceeds to raise the detection flag or indication in step 609 at terminal OUT and thereby signalling the detection of the key phrase. Thereafter, the state machine 600 jumps back to the first internal state 601 once again monitoring the incoming microphone signal and awaiting the next occurrence of the key phrase as illustrated by arrow 621.
The skilled person will understand that the above-described operation of the state machine 600 leads to a reduced risk of false positive detection events of the key word or key phrase because the state machine monitors and evaluates the time relationships between the individual phonemes representing the key word or phrase and skips the sequence if a particular phoneme is missing in the sequence or has an odd time relationship with a preceding phoneme. In the latter situation, the state machine 600 skips the currently detected sequence of phonemes and reverts to the first internal state monitoring the incoming microphone signal for a valid occurrence of the key word or phrase. This reduced risk of false positive detection events of the key word or key phrase is a significant advantage of the present microphone assembly because it reduces the number of times the host processor is triggered by false key word/phrase detection events. Each such false detection event typically leads to significant power consumption in the host processor because asserting the detection flag typically forces the host processor to switch from the previously discussed sleep-mode or low-power mode of operation to an operational mode for example via an interrupt routine running on the host processor.
The skilled person will understand that other embodiments of the key word recognizer 110 may require only a subset of the individual phonemes, e.g. three of the above-discussed four phoneme, representing the key word or phrase be correctly detected before the detection of the key word is flagged. This alternative mechanism may increase the success rate of correct detections of the key word because of accidentally overlooking a single phoneme of the sequence. On the other hand, this entails a risk of triggering a false positive key word detection event.
The skilled person will appreciate that the audio bandwidth of the stored multibit digital signal in the buffer memory is reduced for example to approximately one-half of the original audio bandwidth. This reduced audio bandwidth exists, however, only for the duration of the multibit digital signal held in the buffer memory which may be around 500-800 ms. The multibit digital signal held in the buffer memory comprises inter alia the recognized key word or key phrase (e.g. like “OK Google”) when it is emptied and this key word or key phrase will usually not include any significant amount of high frequency content. Hence, this short moment of reduced audio bandwidth of the multibit digital signal may go essentially unnoticed.