Priority is claimed to application serial no. 202210319615.5, filed Mar. 29, 2022 in China, the disclosure of which is incorporated in its entirety by reference.
The present disclosure generally relates to a voice detection method, and particularly relates to a method for detecting a wearer's voice using an in-ear audio sensor.
Voice detection (or commonly called voice activity detection (VAD)) is used to indicate whether a section of sound contains human voice. It is widely used and can play an important role in voice processing systems and devices such as earphones, hearing aids, etc. On the principle of pronunciation, voice and noise are distinguishable because the process of producing voice makes sound of human, especially voiced sound phonemes, different from most noise. In addition, the intensity of voice in a noisy environment is usually higher than that of pure noise, because noisy voice is the sum of uncorrelated human voice and noise. However, accurately distinguishing voice signals from noise is an industry challenge. The reasons are: the intensity of some voice signals is weak; the types of noise are variable and not always stable; what is more difficult is that more than 20% of unvoiced sound phonemes in human voice do not have a harmonic structure, the relative intensity is weak, and the spectral structure itself is similar to that of some noise. Therefore, accurate voice detection in a noisy environment is always a challenging task.
Furthermore, unvoiced sound detection is still a difficult and unsolved problem. The prior art lacks a detection mechanism with a low missed detection rate and a low false alarm rate for classifying unvoiced sound, voiced sound, and various noise scenes in voice detection.
In one aspect, one or more embodiments of the present disclosure provide a method of detecting voice using an in-ear audio sensor. The method includes performing the following processing on each frame of input signals collected by the in-ear audio sensor: calculating a count change value based on at least one feature of an input signal of a current frame, wherein the at least one feature includes at least one of a signal-to-noise ratio, a spectral centroid, a spectral flux, a spectral flux difference value, spectral flatness, energy distribution, and spectral correlations between adjacent frames; adding the calculated count change value with a previous count value of a previous frame to obtain a current count value; comparing the obtained current count value with a count threshold; and determining the category of the input signal of the current frame based on the comparison result, wherein the category includes noise, voiced sound, or unvoiced sound.
The present disclosure can be better understood by reading the following description of non-limiting implementations with reference to the accompanying drawings.
It should be understood that the following description of the given embodiments is for illustrative purposes only, and not restrictive.
The use of a singular term (such as, but not limited to, “a”) is not intended to limit the number of items. Use of relational terms, such as, but not limited to “top”, “bottom”, “left”, “right”, “upper”, “lower”, “downward”, “upward”, “side”, “first”, “second” (“third”, etc.), “inlet”, “outlet”, etc. are used in written descriptions for clarity when specific reference is made to the drawings and are not intended to limit the scope of the present disclosure or the appended claims, unless otherwise stated. The terms “including” and “such as” are illustrative rather than restrictive, and the word “may” means “may, but does not have to”, unless otherwise stated. Notwithstanding the use of any other language in the present disclosure, the embodiments shown in the drawings are examples given for purposes of illustration and explanation, and not the only embodiment of the subject matter herein.
The present disclosure mainly focuses on voice detection for earphone devices. The earphone device includes at least one in-ear audio sensor. The in-ear audio sensor can be, for example, an in-ear microphone. Usually, the in-ear microphone in the earphone can be widely used as a feedback (FB) microphone for the active noise cancellation (ANC) function.
The method for detecting voice of the present disclosure uses only a signal received by an in-ear audio sensor, and based on key acoustic features, especially based on the combination of threshold conditions associated with the key acoustic features, through a voting mechanism, an input signal is detected, so that voiced sound, unvoiced sound, and noise can be detected with high accuracy.
The inventors conducted research on signals captured by the in-ear microphone. When the in-ear microphone is worn correctly, that is, when the in-ear microphone is inserted into the human ear and physically isolated from the environment, received ambient noise is greatly attenuated.
Likewise, airborne human sound is also isolated to a certain degree. However, a voice signal of human can also be conducted through bones and tissue, and also through the Eustachian tube. The Eustachian tube is a small channel that connects the throat to the middle ear. Compared with an air-conducted voice signal, the voice signal received by the in-ear microphone shows higher intensity in an extremely low frequency band (for example, below 200 Hz). However, in a frequency band of 200-2500 Hz, the intensity of the signal gradually decreases, and the signal almost disappears in a higher frequency range. Interestingly, the inventors found that an unvoiced sound signal can propagate through the narrow Eustachian tube, although the intensity thereof is very weak, even in a high frequency band above 6000 Hz.
The inventor further summarizes the features of voiced sound and unvoiced sound and compares them with various types of noise based on a comprehensive analysis of sound signals received by the in-ear audio sensor, such as the in-ear microphone. Specifically, in an in-ear channel, the unvoiced sound and voiced sound signals are different from signals of noise, as summarized below.
Voiced Sound:
Unvoiced Sound:
Noise: it does not belong to any sound of speech of an earphone wearer
Interference: external sound or sound of talk of non-wearers, or human voice played by other devices, if leaked and taken in by an in-ear audio sensor, it is considered as noise here.
It is worth noting that the voiced sound and the unvoiced sound may be superimposed and polluted with noise, so some preprocessing for noise cancellation is required.
The present disclosure provides a causal voice detection method using only an in-ear microphone. The method starts from at least one acoustic feature, and realizes the detection of voiced sound, unvoiced sound, and noise in voice by developing a combined threshold method. Among them, all thresholds are based on different categories of acoustic features to form the combined threshold conditions for voice detection. For example, the acoustic features include, but are not limited to, a signal-to-noise ratio, a spectral centroid, a spectral flux, a spectral flux difference value, spectral flatness, energy distribution, and spectral correlations between adjacent frames. The formulas of several of these features are given below for easy understanding.
Where k represents an index of a frame. fn represents the center frequency of the nth frequency bin in the spectrum. x(k),{circumflex over (n)}(k) and Xr(fn,k) represent a time signal at the kth frame, the estimated value of the noise floor, and the spectrum value of a received signal at fn, respectively. f(|(x(k)|) represents the total energy or amplitude peak value of the kth frame. Further, where nis and nie represent the index of the frequency bin at the beginning and end of the i-th frequency band, respectively.
As shown in
In some embodiments, in order to improve the determining accuracy of classification, the method will further combine a previous count value of a previous frame to determine a count value of the current frame. The previous count value of the previous frame represents a value of votes for the probability of voice for the previous frame signal. The current count value represents a value of votes for the probability of voice for the current signal. For example, in S304, the calculated count change value of the current frame is added with the previous count value of the previous frame to obtain the current count value.
Next, in S306, the current count value obtained in S304 is compared with a count threshold. And in S308, according to the result of comparison, the category of the current frame of the input signal may be detected. That is, it can be determined whether the current frame is voiced sound, unvoiced sound, or noise.
The method shown in
In S402, preprocessing may be performed on a sound signal received through the in-ear microphone. In some embodiments, high-pass filtering may be performed on the received signal first to filter out DC components and low-frequency noise floor. In some embodiments, mild noise cancellation processing (for example, using a minimum tracking method) may also be performed on the signal to eliminate the part of external noise leaked to the in-ear audio sensor. For example, in order to reduce stationary noise that mainly occurs in a low frequency band, noise cancellation by multiband spectral subtraction can be performed. It should be noted that since both noise and unvoiced sound are relatively weak, noise estimation should avoid overestimation to prevent weak unvoiced sound from being severely damaged. The preprocessing in S402 may be preprocessing on the current frame.
In S404, the estimated SNR of the current frame is compared with an SNR threshold, and the spectral flatness is compared with a corresponding spectral flatness threshold. If the estimated SNR of the current frame is greater than or equal to the SNR threshold, and the spectral flatness is less than or equal to the corresponding spectral flatness threshold, the method proceeds to S406, and calculation of a first count change value is performed. If the estimated SNR of the current frame is less than the SNR threshold, or the spectral flatness is greater than the corresponding spectral flatness threshold, the method proceeds to S408, and calculation of a second count change value is performed.
In some embodiments, in S406, the calculation of the first count change value may include S4062: calculating an addend and S4064: calculating a subtrahend based on a combined threshold condition. The first count change value of the current frame can be obtained based on the calculated addend and subtrahend.
In one example, the combined threshold condition associated with the addend may include combined threshold conditions associated with SNR and spectral flatness. For example, the combined threshold condition may be that SNR is greater than the minimum SNR and the spectral flatness is less than a certain threshold. If the combined threshold condition is satisfied, then based on a value of the estimated SNR, the addend is calculated. For example, when the combined threshold is satisfied, obtained addends are different depending on different values of the estimated SNR.
In another example, the combined threshold condition associated with the subtrahend may include a plurality of combined threshold conditions associated with at least one of energy distribution, spectral flatness, and spectral centroid. For example, the combined threshold condition for count decrease associated with energy distribution and spectral flatness may define the following conditions: more than 90% of signal energy is distributed below 1250 Hz, and in each frequency band, such as 100-1250 Hz, 1250-2500 Hz, 2500-5000 Hz, and 5000-7500 Hz, the spectral flatness is very high. For example, the combined threshold condition for count decrease associated with energy distribution and spectral flatness may further define the following conditions: more than 95% of signal energy is distributed in 300-1250 Hz, and the spectral flatness of a frequency band below 300 Hz is very high. For example, the combined threshold condition associated with energy distribution and spectral centroid may define the following condition: energy is distributed in the high frequency part, for example, the spectral centroid being above 4000 Hz. It is worth noting that the present disclosure explains, only by taking an example, the principle of the combined threshold conditions, rather than exhaustively or specifically limiting the combined threshold conditions. Those skilled in the art can realize through the principle of the combined threshold conditions disclosed in the present disclosure that the combined threshold condition for count decrease can be formed based on at least one of features listed above such as SNR, spectral centroid, spectral flux, spectral flux difference value, spectral flatness, energy distribution, and spectral correlations between adjacent frames. The subtrahend can be calculated based on the above at least one combined threshold condition for count decrease.
Thus, in S406, based on the addend calculated in S4062 and the subtrahend calculated in S4064, the first count change value is obtained.
Next, in S410, the current count value is calculated. For example, the first count change value calculated in S406 is added with the previous count value of the previous frame to obtain the current count value.
Next, in S412, whether the current count value is greater than a count threshold is determined. If the current count value is greater than the count threshold, the method proceeds to S414, and the input signal of the current frame is determined as voiced sound. If the current count value is less than or equal to the count threshold, the method proceeds to S416, and the input signal of the current frame is determined as noise. The count threshold can be preset. For example, it can be set to 0.
In addition, in some embodiments, the magnitude of the current count value (that is, the magnitude of the value of the votes) may also correspond to different probability values respectively, so as to be used for determining the probability at which voice is contained.
In addition, in some embodiments, for example, in S418, it is determined whether the subtrahend calculated in S4064 is greater than the count threshold. If the subtrahend is greater than the count threshold, the input signal of the current frame is determined as voice hangover. Voice hangover refers to brief pauses between voice elements or syllables. If the input signal of the current frame is determined as voice hangover, it means that the determining result of the probability of voice of the current frame will continue to be the voice determining result of the previous frame (for example, it is determined as unvoiced sound or voiced sound). The present disclosure introduces the determining mechanism for voice hangover, and makes a more detailed classification for situations in voice detection, thereby improving the fineness of detection and improving the efficiency of voice detection, and at the same time avoiding some unnecessary operations caused when a very short pause between syllables is taken as noise.
In some other embodiments, in S408, the calculation of the second count change value may include S4082: calculating a voiced sound addend value, S4084: calculating an unvoiced sound addend value, and S4086 calculating the subtrahend. Based on the voiced sound addend value calculated in S4082, the unvoiced sound addend value calculated in S4084, and the subtrahend calculated in S4086, the second count change value of the current frame may be obtained.
In one example, S4082: calculating a voiced sound addend value may include: calculating the voiced sound addend value based on the combined threshold condition for voiced sound. The combined threshold condition for voiced sound may include a plurality of combined threshold conditions associated with at least one of energy distribution, spectral flatness, spectral centroid, and spectral flux. For example, the combined threshold condition for voiced sound associated with energy and spectral flatness may define the following conditions: high energy and low spectral flatness (with a harmonic structure). For example, the combined condition associated with energy distribution may define the following condition: energy attenuates as the frequency increases and substantially disappears at above 2500 Hz. It is worth noting that the present disclosure explains, only by taking an example, the principle of the combined threshold condition for voiced sound, rather than exhaustively or specifically limiting the combined threshold condition for voiced sound. Those skilled in the art can realize through the principle of the combined threshold condition for voiced sound disclosed in the present disclosure that the combined threshold condition for voiced sound can be formed based on at least one of the features of voiced sound listed above such as SNR, spectral centroid, spectral flux, spectral flux difference value, spectral flatness, energy distribution, and spectral correlations between adjacent frames.
In one example, S4084: calculating an unvoiced sound addend value may include: calculating the unvoiced sound addend value based on the combined threshold condition for unvoiced sound. The combined threshold condition for unvoiced sound may include a plurality of combined threshold conditions associated with at least one of energy distribution, spectral flatness, spectral centroid, and spectral flux. For example, the combined threshold condition for unvoiced sound associated with energy distribution and spectral flatness may define the following conditions: a wideband signal, uniform energy distribution in each frequency band, large total spectral flatness, and high spectral flatness in each frequency band. The combined threshold condition for unvoiced sound associated with energy distribution, spectral flux, and spectral flatness can also define the following conditions: at the beginning of voice, the energy is concentrated in a frequency band of 2500-7500 HZ, and the spectral flatness is relatively high at 2500-5000 Hz and 5000-7500 HZ, with increased energy compared to the previous frame (i.e., the spectral flux difference value being positive). It can be understood that the present disclosure explains, only by taking an example, the principle of the combined threshold condition for unvoiced sound, rather than exhaustively or specifically limiting the combined threshold condition for unvoiced sound. Those skilled in the art can realize through the principle of the combined threshold conditions for unvoiced sound disclosed in the present disclosure that the combined threshold condition for unvoiced sound can be formed based on at least one of the features of unvoiced sound listed above such as SNR, spectral centroid, spectral flux, spectral flux difference value, spectral flatness, energy distribution, and spectral correlations between adjacent frames.
In another example, S4086: calculating a subtrahend may include: calculating the subtrahend based on at least one combined threshold condition for count decrease. The specific process of calculating the subtrahend in S4086 may be similar to that of the calculation of the subtrahend in S4064, and the details are omitted here.
Thus, in S408, based on the voiced sound addend value calculated in S4082, the unvoiced sound addend value calculated in S4084, and the subtrahend calculated in S4086, the second count change value is obtained.
Next, in S422, the current count value is calculated. For example, the second count change value calculated in S408 is added with the previous count value of the previous frame to obtain the current count value.
Next, in S424, whether the current count value is greater than the count threshold is determined. The count threshold can be preset. For example, it can be set to 0. If the current count value is less than or equal to the count threshold, the method proceeds to S426, and the input signal of the current frame is determined as noise. If the current count value is greater than the count threshold, the method proceeds to S428. In S428, it is further determined whether the unvoiced sound addend value calculated in S4082 is greater than the count threshold. If the unvoiced sound addend value is greater than the count threshold, the method proceeds to S430, and the input signal of the current frame is determined as unvoiced sound. If the unvoiced sound addend value is less than or equal to the count threshold, the method proceeds to S432, and the input signal of the current frame is determined as voiced sound.
In some embodiments, the magnitude of the current count value (that is, the magnitude of the value of the votes) and/or the magnitude of the voiced sound addend value may also correspond to different probability values, so as to be used for determining the probability of voice.
In addition, in some embodiments, in S434, it is determined whether the subtrahend calculated in S4086 is greater than the count threshold. If the subtrahend is greater than the count threshold, the method proceeds to S436, and the input signal of the current frame is determined as voice hangover.
The method shown in
Further, in the method of
In some other embodiments, the voice detection method of the present disclosure further includes a method for further correcting the detection result, for example, a method for correcting noise misjudgment by using time-domain features. In an example, if at least one second combined threshold condition is satisfied, the determining result of the signal of the current frame is corrected as noise, wherein the at least one second combined threshold condition includes a combined threshold condition associated with signal energy distribution and spectral correlations between adjacent frames. For example, if there is a high spectral correlation between adjacent frames between the high-frequency part of the signal and the previous frame, the determining result of the signal can be corrected as noise. In another example, if continuous multiple frames are determined as an unvoiced sound signal, the determining result of the signal may be modified as noise. The accuracy and robustness of the voice detection method and system of the present disclosure can be further improved by further correcting the voice detection result.
Clause 1. In some embodiments, a method for detecting voice using an in-ear audio sensor, comprising:
performing the following processing on each frame of input signals collected by the in-ear audio sensor:
Clause 2. The method according to any one of the above-mentioned clause, wherein each feature has one or more threshold conditions associated therewith, and wherein the calculating a count change value based on at least one feature of an input signal of a current frame comprises:
Clause 3. The method according to any one of the above-mentioned clauses, further comprising: determining whether the estimated signal-to-noise ratio of the current frame is greater than or equal to a signal-to-noise ratio threshold and the spectral flatness is less than or equal to a spectral flatness threshold; and
Clause 4. The method according to any one of the above-mentioned clauses, wherein the performing calculation of a first count change value comprises:
Clause 5. The method according to any one of the above-mentioned clauses, wherein the performing calculation of a second count change value comprises:
Clause 6. The method according to any one of the above-mentioned clauses, further comprising:
Clause 7. The method according to any one of the above-mentioned clauses, further comprising:
Clause 8. The method according to any one of the above-mentioned clauses, further comprising:
Clause 9. The method according to any one of the above-mentioned clauses, further comprising:
Clause 10. The method according to any one of the above-mentioned clauses, further comprising:
Clause 11. The method according to any one of the above-mentioned clauses, further comprising: if at least one second combined threshold condition is satisfied, correcting the determining result of the signal of the current frame as noise, wherein the at least one second combined threshold condition comprises a combined threshold condition associated with the signal energy distribution and the spectral correlations between adjacent frames.
Clause 12. The method according to any one of the above-mentioned clauses, further comprising: if an input signal of continuous multiple frames is determined as unvoiced sound, modifying the determining result of the input signal of the continuous multiple frames as noise.
Clause 13. In some embodiments, a computer-readable medium, on which computer-readable instructions are stored, when executed by a computer, the computer-readable instructions realizing any one of the methods according to claims 1-12.
Clause 14. In some embodiments, a system, comprising a memory and a processor, the memory storing computer-readable instructions, when executed by the processor, the computer-readable instructions realizing any one of the methods according to claims 1-12.
Any one or more of the processor, memory, or system described herein includes computer-executable instructions that can be compiled or interpreted from computer programs created using various programming languages and/or technologies. Generally speaking, a processor (such as a microprocessor) receives and executes instructions, for example, from a memory, a computer-readable medium, etc. The processor includes a non-transitory computer-readable storage medium capable of executing instructions of a software program. The computer-readable medium can be, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof.
The description of the implementations has been presented for the purposes of illustration and description. Appropriate modifications and changes of the implementations can be implemented in view of the above description or can be obtained through practical methods. For example, unless otherwise indicated, one or more of the methods described may be performed by a combination of suitable devices and/or systems. The method can be performed in the following manner: using one or more logic devices (for example, processors) in combination with one or more additional hardware elements (such as storage devices, memories, circuits, hardware network interfaces, etc.) to perform stored instructions. The method and associated actions can also be executed in parallel and/or simultaneously in various orders other than the order described in this disclosure. The system is illustrative in nature, and may include additional elements and/or omit elements. The subject matter of the present disclosure includes all novel and non-obvious combinations of the disclosed various methods and system configurations and other features, functions, and/or properties.
The description of the implementations has been presented for the purposes of illustration and description. Appropriate modifications and changes of the implementations can be implemented in view of the above description or can be obtained through practical methods. The described method and associated actions can also be executed in parallel and/or simultaneously in various orders other than the order described in this application. The described system is illustrative in nature, and may include other elements and/or omit elements. The subject matter of the present disclosure includes all novel and non-obvious combinations of the disclosed various systems and configurations and other features, functions, and/or properties.
As used in this application, an element or step listed in the singular form and preceded by the word “one/a” should be understood as not excluding a plurality of said elements or steps, unless such exclusion is indicated. Furthermore, references to “one implementation” or “an example” of the present disclosure are not intended to be interpreted as excluding the existence of additional implementations that also incorporate the recited features. The present invention has been described above with reference to specific implementations. However, those of ordinary skill in the art will appreciate that, without departing from the broad spirit and scope of the present invention as set forth in the appended claims, various modifications and changes can be made thereto.
Number | Date | Country | Kind |
---|---|---|---|
202210319615.5 | Mar 2022 | CN | national |