The present invention relates to voice optimization of audio and more precisely to a method and a device for providing voice optimization in noisy environments.
Portable electronic equipment is used by virtually everybody, everywhere. For instance, a mobile phone is carried at all times and can be used to make calls or listen to audio. The audio listened to may be music, but podcasts and audiobooks are becoming increasingly common. With the increased use and portability of electronic devices for communication and entertainment, the risk that the audio is consumed in a noisy environment is increased. With music, a noisy environment may be nothing more than a nuisance, but when it comes to listening to speech audio, the noisy environment may make the speech unintelligible over the noise.
Speech intelligibility of speech audio will depend on the signal to noise ratio, in this case, the ratio between the speech audio and the noise. Historically, speech intelligibility is improved by modifying the signal to noise ratio. The brute force approach is to amplify the voice signal such that it is intelligible over the noise, needless to say, this approach may cause damage to the hearing of the person listening to the speech audio. Another approach is, if headphones are used, to decrease the noise by forming the headphones to attenuate outside noise or to utilizing active noise cancellation. The noise attenuation will depend on acoustic design and the fitting of the headphones on the user. The active noise cancellation requires significant processing power with increased material cost and energy consumption as a result.
From the above, it is understood that there is room for improvements.
An object of the present invention is to provide a new type of voice optimization which is improved over prior art and which eliminates or at least mitigates the drawbacks discussed above. More specifically, an object of the invention is to provide a method and an audio device that is improves the intelligibility of speech or voiced audio in noisy environments. These objects are achieved by the technique set forth in the appended independent claims with preferred embodiments defined in the dependent claims related thereto.
In a first aspect, a method of increasing speech intelligibility of an audio stream comprising speech audio is presented. The method is performed in real-time by an audio device and comprises detecting an ambient noise, estimating an internal noise based on the ambient noise, and determining a voice filter based on the estimated internal noise and the audio stream. The method further comprises applying the voice filter to the audio stream to provide a target audio stream, and outputting the target audio stream to one or more transducers thereby generating an internal sound of the audio device. In addition to this, the method comprises detecting the internal sound of the audio device and the step of determining the voice filter is further based on the detected internal sound and comprises subtracting the estimated internal noise from the detected internal sound to provide a true audio stream. The method further comprises updating the voice filter based on a difference between the target audio stream and the true audio stream.
In one variant, the step of determining the voice filter further comprises comparing the estimated internal noise to one or more masking thresholds, and updating the voice filter based the comparing. This is beneficial as it provides for an energy and computational efficient way of determining if the audio is masked by the noise or not.
In one variant, said one or more masking thresholds are calculated by performing a critical band analysis of the audio stream. The critical band analysis comprises auditory masking by frequency spreading. This is beneficial as it increases the accuracy of the masking thresholds.
In one variant, the method further comprises filtering the audio-stream to compensate for a hearing profile associated with a user of the audio device. This is beneficial as the speech intelligibility is further increased and optimized for the user.
In one variant, the step of determining the voice filter is done after the filtering, such that the determining is based on an audio stream compensated for a hearing profile associated with a user of the audio device. This is beneficial as the same voice filer algorithm may be used regardless of user and the computational effort may be reduces as some compensation is already applied through the hearing profile.
In one variant, the step of determining the voice filter further comprises determining a playback phon based on a playback volume, and the step of updating the voice filter is further based on an equal loudness contour associated with the determined phon. This is beneficial as the speech intelligibility will change over volume, but not evenly across all frequencies and compensating for this increase speech intelligibility regardless of playback volume.
In one variant, the step of determining the playback phon is further based on the internal sound internal sound. This is beneficial as it gives an accurate reading of the actual sound pressure level experienced by the user.
In one variant, the step of determining the voice filter further comprises smoothing a gain of the voice filter in frequency by convolution using a frequency window function. This is beneficial as it removes unwanted differences between adjacent groups of frequencies.
In one variant, the step of determining the voice filter further comprises averaging the gain of the voice filter using an exponentially weighted moving average comprising one or more weighting parameters. This is beneficial as it removes unwanted differences between adjacent groups of frequencies.
In one variant, the step of determining the voice filter further comprises applying configurable mixing setting to select the degree at which the voice filter is to be applied to the audio stream. This is beneficial as it makes the amount of improvement customizable and the user may select the desired amount of compensation.
In one variant, the step of estimating internal noise is implemented by one or more Recurring Neural Networks, RNN. The use of RNN is beneficial as it allows an accurate and efficient way of estimating the internal noise.
In one variant, the ambient noise is detected by an external microphone operatively connected to the audio device. This is beneficial as it provides an accurate measure of the ambient noise.
In one variant, the ambient noise is limited to a maximum audio bandwidth of up to 10 kHz, preferably up to 8 kHz. This is beneficial as it further decreases the computational complexity of the present method.
In one variant, the method further comprises applying Active Noise Cancellation, ANC, to the audio stream after applying the voice filter to the audio stream. This is beneficial as the noise of the internal sound is further reduced.
In a second aspect, an audio device is presented. The audio device comprises one or more transducers, at least one internal microphone arranged to detect an internal sound at an ear cavity of a user, and a processing module operatively connected to the internal microphone, to said one or more transducers and to an external microphone. The processing module is configured to perform the method the present invention.
In one variant, the external microphone is comprised in the audio device. This is beneficial as the data from the microphone is readily available to the processing module.
In a third aspect, and audio system for increasing speech intelligibility in real-time is presented. The system comprising a portable electronic device operatively connected to an audio device and configured to transfer an audio stream comprising speech audio to the audio device, wherein the audio device is the audio device according to the present invention.
In one variant, an ambient noise is sensed by an external microphone comprised in the electronic device, and the electronic device is further configured to transfer the ambient noise sensed by the external microphone to the audio device. This is beneficial as additional noise data may be provided by the external microphone of the audio device. Alternatively or additionally, the audio device may be configured without an external microphone, thus decreasing cost of the audio device.
In a fourth aspect, a computer program product is presented. The computer program product is configured to, when executed by a processing module, causes the processing module to perform the method of the present invention.
Embodiments of the invention will be described in the following; references being made to the appended diagrammatical drawings which illustrate non-limiting examples of how the inventive concept can be reduced into practice.
Hereinafter, certain embodiments will be described more fully with reference to the accompanying drawings. The invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and will fully convey the scope of the invention, such as it is defined in the appended claims, to those skilled in the art.
The term “coupled” is defined as connected, although not necessarily directly, and not necessarily mechanically. Two or more items that are “coupled” may be integral with each other. The terms “a” and “an” are defined as one or more unless this disclosure explicitly requires otherwise. The terms “substantially,” “approximately,” and “about” are defined as largely, but not necessarily wholly what is specified, as understood by a person of ordinary skill in the art. The terms “comprise” (and any form of comprise, such as “comprises” and “comprising”), “have” (and any form of have, such as “has” and “having”), “include” (and any form of include, such as “includes” and “including”) and “contain” (and any form of contain, such as “contains” and “containing”) are open-ended linking verbs. As a result, a method that “comprises,” “has,” “includes” or “contains” one or more steps possesses those one or more steps, but is not limited to possessing only those one or more steps.
The transducer(s) 35 of the audio device 30 are configured to generate sound directed to an ear cavity of a user of the audio device 30. The audio device 30 is provided with one or more internal microphones 36 arranged to measure the sound generated by the transducers 35. The sound is preferably measured at the ear cavity of the user when the audio device 30 is used by the user. Preferably, one internal microphone 36 is provided to measure the sound generated by each of the transducers 35.
The audio system 1 is further provided with one or more external microphones 5. The external microphone 5 is external to the audio device 30 and may be any suitable microphone 5 operatively connected to the processing module 32 of the audio device 30. The external microphone 5 may be comprised in the audio device 20 when, e.g. the audio device 30 is a headset, the external microphone may be arranged to detect a voice of the user of the audio device 30. Alternatively, or additionally, the external microphone 5 may be comprised in the portable electronic device 10 when, e.g. the portable electronic device 10 is a mobile terminal 10.
Turning now to
With reference to
In some embodiments of the method 100, it further comprises filtering 105 the audio stream 20 to compensate for a hearing profile HL(fk) of a user of the system 1. This is beneficial since hearing disabilities and/or impairments of the user are compensated for in addition to the speech intelligibility of the method. Preferably the hearing profile HL(fk) compensation is applied to the audio source signal 20 prior to applying or determining 130 the voice filter 50, such that the voice filter 50 is determined based on an audio stream 20 compensated for hearing disabilities and/or impairments of the user. This is beneficial since it effectively removes differences between users and the same methodology for determining 130 the voice filter 50 may be used for all users. Additionally, since the compensation with respect to the hearing capabilities only affects the audio stream 20, it will, in most cases, directly improve the speech intelligibility. Further to this, the hearing profile compensation can be accounted for in the determination of the voice filter 50 and no pre-processing is required until the voice filter 50 is applied 140. The relation between the hearing profile HL(fk) and the ambient noise 40 may be important, as the processing result to increase speech intelligibility may, for some users, have reduced effect with respect to the hearing capabilities of the user if the users hearing profile HL(fk) is not considered.
It should be mentioned that in audio systems 1 or audio devices 30 utilizing noise cancellation techniques, such as active noise cancellation, ANC, the noise cancellation is preferably applied to the audio stream 20 after the voice filter 50. This is beneficial since doing noise cancellation decreases the noise level, but at the same, may cause the audio signal to become distorted. The degree of distortion depends on the configuration of the noise cancellation and the tuning and/or calibration of the noise cancellation technique.
The target audio stream 20′ may become distorted or otherwise negatively impacted by e.g. digital to analogue converter, the transducer operation and the position of the audio device 30 on the user. Consequently, it is beneficial to detect the internal sound 37 presented at the eardrum of the user after removal of noise by the noise cancellation, compare true audio stream 37′, i.e. the internal sound 37 after removal of internal noise 40′, to the target audio stream 20′ and act to minimize the differences.
As already indicated, the audio system 1 is an audio system 1 with real-time constraints. The audio stream 20 is received as digital samples either per sample basis or per frame basis. The collection of samples in frames can be done elsewhere or as a part of the system 1, e.g. by the electronic device 10. The audio stream 20 comprise a collection of N samples with sample rate Fs samples per second, these are formed into an audio signal frame with frame (time) index l. The audio stream 20 can be mono or stereo.
The voice filter 50 is preferably based on psychoacoustic masking and comprise a psychoacoustic model derived from a speech intelligibility index or equivalent e.g. Articulation Index, Speech Transmission Index or Short Term Objective Intelligibility, and theory of tonal masking of noise. A frequency gain of the voice filter 50 is calculated such that the internal noise 40′ is masked by the target audio stream 20′, this will be explained in more detailed in other sections of this disclosure.
The estimated internal noise 40′ may be provided in a number of different ways and the input to the estimation 120 of the internal noise 40′ is at least the ambient noise 40 detected 110 by one of the external microphones 5. The ambient noise 40 is provided as a microphone signal by the external microphone and is preferably represented in a frame-wise construction substantially equal to that of the audio stream 20. The microphone signal may also be a stereo signal, such a signal is typically termed dual microphone signal. A dual microphone signal comprises two independent microphone signals formatted as a single stereo signal. As previously explained, there may be several external microphones 5 in the system 1, and the step of estimating 120 the internal noise 40′ may comprise determining to use e.g. only one microphone signal of all microphone signals provided by the external microphones 5. A decision pertaining to which external microphone to use may be based on e.g. a highest signal level, proximity to the transducer etc. All external microphones 5 may be processed separately in order to obtain ambient noise 40 from each of the processed external microphones 5. The external microphones 5 may be processed to obtain a stereo signal, and even processed to obtain a direction of the ambient noise 40 such that each transducer 35 may be associated with a different ambient noise 40.
As the skilled person will understand after digesting the teachings herein, there may, based on resource management and optimization of available processing power in the real-time audio system 1, be a trade-off necessary whether to utilize several external microphones 5 or not for detecting 110 the ambient noise 40. This trade-off may be application dependent. When the audio device 30 is e.g. a pair of headphones 30, and if enough processing capabilities exist, two separate estimations 120 of the internal noise 40′ are feasible, one for a left ear of a user and one for a right ear of the user. However, if e.g. processing capabilities are insufficient or if stringent requirements on current consumption exist, a rational assumption may be the ambient noise 40 is substantially equal in for the left and the right ear and the same internal noise 40′ may be utilized for both ears.
The external microphones 5 may be sampled with a different sample rate compared to the audio signal. Given that the important frequency range for voice communication is up to 8 kHz, the external microphones 5 be bandlimited to a maximum bandwidth of 10 kHz, or preferably 8 kHz. The lower bandwidth reduces processing load, memory load and current consumption. The maximum bandwidth may be reduced even further to reduce the processing load, the memory load and the current consumption additionally, but the maximum bandwidth has to be traded-off for requirements of the ambient noise 40.
It should be noted, as the skilled person will appreciate after reading this disclosure, that the external microphones 5 will produce a signal comprising both the ambient noise 40 and additional sound sources. Only the ambient noise 40 is relevant which means that e.g. echo originating from the transducer and near-end talk from listener engaging in a conversation is beneficial to exclude from the signal produced by the external microphones 5. This is beneficial since it reduces the risk that that the additional sound sources are miss-classified as ambient noise 40.
One solution to avoid miss-classifying the additional sound as ambient noise is to use one or more noise estimation techniques e.g. higher degree statistics, Cepstrum analysis, Auto-regressive modelling or non-parametric methods as Welch spectrum and Minimum variance method. Typically, for minimum effort implementations, the method 100 may stop the detection 110 of ambient noise 40 and/or the estimating 120 of the internal noise 40′ if additional sound sources are detected by the external microphone(s) 5. The distinction between background noise and voice sources may be solved using e.g. a voice activity detector, VAD.
In one embodiment of the present invention, the internal noise 40′ is estimated 120 by a Recurring Neural Network, RNN. This will be explained in more detail in other sections of the disclosure, but one benefit is that the complexity of implementation and configuration of e.g. estimating 120 the internal noise 120, detection of additional sounds, voice detection etc. are exchanged into training and operating the RNN, which is well described in the theory of machine learning.
Regardless of how the internal noise 40′ is estimated 120, a representation of the internal noise 40′ comprise an average energy values Ev(b), b=1 . . . NB of the internal noise 40′ only for each auditory filter band, or critical band b. The concept of critical bands b is explained in the following sections.
As previously explained, the frequency gain of the voice filter 50 is calculated such that the internal noise 40′ is masked by the target audio stream 20′. In order to accomplish this, the audio stream 20 is filtered represented either in frequency or as a critical band b representation. This may be accomplished by having the audio stream 20 divided into sub-frames, allowing for up to e.g. 50% overlap with a previous sub-frame. The sub-frames may be windowed using a suitable window function e.g. Hamming, Hanning, Triangular windows etc. A power spectrum Px(k)=|X(k)|2 is calculated using the sub-framed time domain data and a fast Fourier transform implementation, FFT, where k is a frequency-bin index. The resolution of the frequency transform is preferably selected based on the sample rate Fs and a sub-frame size. Typically, a trade-off between resolution and resource demand is required.
The quantities described in the frequency domain are all represented in sound pressure level, SPL, such that Px(k)/N is the power spectral density in SPL per frequency-bin index k and referenced to a free-field reference point located at the listeners ear cavity. The conversion from digital signals to sound pressure level is done by an appropriate frequency dependent scaling, one scaling frequency function per microphone 5, 36 and one scaling frequency function per transducer 35. The scaling functions are predetermined and/or configurable and preferably stored in a memory operatively coupled to the processing module 32. The scaling functions can be considered as a calibration step conducted once during the design or configuration of the audio device 30. Typically, but not limited to, a scaling function for a microphone 5, 36 consist of one scale value per frequency-bin index k and can be estimated from a microphone frequency response. A scaling frequency function for a transducer 35 would correspond to a transducer frequency response including scaling due to e.g. the distance to a reference point, typically the ear of the listener. As way of exemplifying, for a pair of headphones 30, the scaling frequency function for the audio stream 20 would be based on a frequency response of the transducer 35 referenced to an ear-reference-point, ERP.
A cochlear model divides the audio stream 20 into NB frequency bands, each frequency band representing one critical band b. The number of critical bands NB can be set according to desired resolution in the frequency domain making it possible to directly control the granularity at which the audio stream 20 can be adjusted by the voice filter 50. As the skilled person will understand, there is a trade-off between frequency resolution and resource demand, increasing the resolution require a larger cochlear model and therefore both a higher computational effort and a more complex implementation. The inventors behind this disclosure has found that that 20 frequency bands, NB=20, is a reasonable choice regarding frequency resolution and computational complexity. Without loss of generality, the division into critical bands b may be made using an equivalent rectangular bandwidth, ERB, scale and a gamma-tone filter bank. Other scales and filter types may be utilized to correctly provide the cochlear model. For a general signal, for each critical band b, an averaged energy ex(b)=1/N Σk Px(k)|Fb(k)|2 is calculated using the power spectrum and where Fb (k) is the frequency response of the filter associated with the critical band b in the gamma-tone filter bank. The energy in each band is represented as a vector by ēx=[ex(1), . . . , ex(NB)]T.
In the following, the psychoacoustic property of masking will be explained in some further detail and in particular how a first signal can mask a second signal, resulting in the second signal not being perceived. For explanatory purposes, assume that the audio stream 20 is approximated by a tonal signal while the ambient noise 40 is a wideband noise, the theory of tone masking noise applies.
A masking threshold T(b) associated with the critical band b is calculated based on the critical band analysis of the audio stream 20 which may comprise masking in frequency, simultaneous frequency masking by spreading, and temporal masking by gain smoothing.
The critical band analysis is applied in order to obtain a critical band representation ēs of the audio stream 20 as ēs=[es(1), . . . , es(NB)]. Note that this means that the masking threshold T(b) as described above, will, in embodiments where compensation 105 according to a hearing profile HL(fk) is firstly applied, take hearing impairments of the user into account.
The simultaneous frequency masking by spreading mentioned above may be described by spreading function SF that models frequency spreading. The spreading function SF may be given by: SFdB(x)=15.81+7.5(x+0.474)−17.5√{square root over (1+(x+0.474)2)} dB, where x has unit of Barks1 and −SFdB(x) is described in dB. It should be noted that frequency spreading is a convolution in the critical band domain and may be represented by a convolutional kernel matrix
The audio system 1 is preferably configured to compute at least one masking threshold T to be used in the voice optimization, i.e. in improving speech intelligibility. Depending on the application, e.g. headphones 30 etc. there can be two or more masking thresholds
Based on the masking threshold
Voice optimization in this context refers to the process of calculating a frequency dependent gain, represented by a vector
The amplification of the audio stream 20 in each critical band b by √{square root over (gopt(b))} result in a voice optimized signal, i.e. the target audio stream 20′.
Note that the disclosure described herein performs simultaneous masking in frequency that is included in the model (matrix
As previously stated, the audio stream 20, ēs, is used to calculate the masking threshold T(b) to which the estimated internal noise 40′ is compared and deemed masked or not. The presentation of the voice optimized signal 20′, the target audio stream 20′, at the user's ear cavity via the transducer 35 may be impacted by several frequency dependent components, among which perhaps the most important is the position and fit of the headphones 30, resulting in that the presented version of the target audio stream 20′ does not have the expected frequency content. The perceived optimized voice signal ēs is part of the microphone signal 37 along with the estimated internal ambient noise 40′.
Opposed to ANC or noise reduction techniques where the noise is measured by a combination of the external microphone 5 and the internal microphone 36, in the present invention, the internal noise 40′, ēv, is estimated based on the external noise 40. Hence, the presented voice optimized signal ēs can be estimated by, but not limited to, e.g. subtracting ēv from the internal sound 37. One further benefit of this is that it enables real-time adjustment to the voice optimization processing such that ēs will converge towards ēs in some pre-defined measure, e.g. root-mean.-square, and that can then account for e.g., changing position or degree of fitting of the headphones 30 and make the method 100 more robust and resilient.
In one non-limiting example, the amplification may be such that a gain is calculated which, when applied to the audio stream 20, would correspond to a resulting masking threshold
Using the desired goal function
W is a diagonal weighting matrix, the main diagonal in this example is populated by the frequency band weighting as given by a speech intelligibility index3. The optimization utilizes the frequency spreading explicitly in the processes of which frequencies shall be amplified or attenuated with an importance weighting over frequencies. 2 Boyd, S., & Vandenberghe, L. (2009). Convex Optimization. Cambridge University Press. ETSI. (2012). Speech and multimedia Transmission Quality (STQ); Speech quality performance in the presence of background noise; Part 1: Background noise simulation technique and background noise database 202 396-1. ETSI3 S3.5-1997, A. (1997). Methods for calculation of the Speech Intelligibility Index. ANSI.
In another non-limiting example, the weighting matrix W may be populated based on the error between the target audio stream 20′ provided to the transducer 35 and the corresponding detected internal sound 37 provided by the internal microphone 36 after the internal noise 40′ has been removed i.e., the internal noise 40′ noise estimated 120 is subtracted from the signal provided by the internal microphone 36. A suitable weighting matrix W may in this case be based on an error in the frequency domain, preferably in the auditory band domain, and even more preferably such that appropriate weight-values are in the range of [0-1] and most preferably normalized to e.g. a root-mean-square value of unity, e.g. using the correlation coefficient between ēs and ēs where high correlation, i.e. signals being highly similar, corresponds to a low weight, i.e. no focus on the error at this frequency band, and vice versa.
The optimal gain
Observe that Hg(k) is updated once for every new frame l of audio stream 20 and internal noise 40′. Typically, an audio frame is relatively short, a number of samples of the audio frame may be less than an equivalent of 100 ms duration. The human ear, on the other hand, has an integration time of up to 100-300 ms. In addition to this, when applying adaptive frequency adjustment, the listener must experience stability in the tonal balance of the audio stream 20, failure to accomplish this may result in discomfort to the user. Another aspect is the frequency variation in the voice optimized signal, i.e. the target audio stream 20′. Adjacent frequency bands must not differ too much in degree of adjustment or an annoying sensation may arise. All of these properties are subjective and, after reading this disclosure, known to the skilled person.
The inventors behind this disclosure have realized that the gain of the voice filter 50 may be processed to order to mitigate the above subjective effects. In one embodiment, the gain of the voice filter 50 is smoothed in frequency by convolution with a frequency window function e.g. a triangular window or similar to assert that isolated frequency-bins do not have either too high amplification or too high attenuation compared to adjacent frequency-bins, i.e. the variation in gain between frequency-bins is limited. In one embodiment the window can be set to a typical value of [0.15,0.7,0.15], i.e. after convolution the resulting gain in each frequency band consist of a 15/15 percent ratio of the adjacent bands and 70 percent ratio of the current band. Typically, it may be unwise to include more than 3-5 critical bands in such convolution operation since each critical band is more independent of adjacent bands the further apart they are in frequency. In another additional or alternative embodiment, the gain of the voice filter 50 is averaged using an exponentially weighted moving average with weighting parameter Ti. The weighting parameter Ti may e.g. be selectable by the user or set to a fixed value corresponding to e.g. the integration time of the human ear i.e. −Ti=0.3. This will effectively also slow down the update rate and therefore allow the user to adjust the hearing to the frequency coloration
As is evident to the skilled person after learning of the method 100 according to the present disclosure, the method 100 is concerned with processing (filtering, changing) the audio stream 20 to increase speech intelligibility of the audio stream 20. The method 100 does not, in its most general form, comprise adding any inverse of the ambient noise 40 or estimated internal noise 40′. The method 100 rather changes (filters, processes) the audio stream 20 by the voice filter 50. The voice filter 50 is a filter adapted to increase speech intelligibility, it is not concerned with removing the ambient noise 40 or the estimated internal noise 40′, it is rather concerned with adapting the audio stream 20 to make increase intelligibility of speech comprised in the audio stream 20 when listened to in noisy environments. Filtering the speech audio of the audio stream 20 such that its intelligibility is increased will change the frequency content of the speech audio. In other words, a voice of a person uttering the speech audio may appear foreign or distorted after being subjected to the voice filter, but the intelligibility of the speech is increased.
It should be emphasized that the gain adjustment of the voice filter 50 will be distributed over all frequencies of a source signal bandwidth due to the simultaneous masking and the energy constraint. Hence, at some frequencies, where the noise is masked, the audio source signal may be attenuated and vice versa. This phenomena is illustrated in
If applicable, a hearing impairment compensation may be described as, but is not limited to, a filter in either time or frequency domain that counter-act or mitigates the hearing impairments of the user. Hearing impairments may be described by the hearing profile HL(fk), a frequency function HL(fk), fk indicate a set of discrete frequencies, (typically a set of 5-7 frequencies are used) in the unit of Hearing Level dB per frequency, dB HL. The hearing profile HL(fk) is equal or equivalent to, but is not limited to, an audiogram that is the result of an audiology exam where a tone audiogram was conducted. No impairments corresponds to 0 dB HL and increasing values, i.e. values larger than 0, indicate a hearing impairment or deficiency. The creation of a compensation to mitigate the hearing impairments is described later. In one embodiment, the hearing impairment compensation is defined in the frequency-domain by a frequency function HHI(k), i.e. a filter which is the compensation due to the users hearing profile HL(fk), and can be applied 105 to the audio stream 20 before the voice optimization. Alternative, it can be included in the voice optimization 130 without prior processing of the audio stream 20. As previously shown, HHI(k) can be grouped into critical band representation and applied as a frequency-bin scaling to ēs, thereby it will be included in the resulting optimal gain Hg(k). A final frequency amplitude adjustment is given by HHI(k)Hg(k)=Hvo(k) of the voice filter 50. This means that the voice enhancement provided by the voice filter 50 due to the ambient noise 40 in low noise conditions may be a unity gain voice filter 50 over all frequencies since the hearing impairment compensation provides sufficient speech intelligibility improvement.
In one embodiment, the user may, via e.g. a mixing setting select the degree m to which the speech compensation to mask the ambient noise 40 shall be applied. For each frequency-bin and −m=[0, . . . ,], such that:
20*log10(|Hvo(k)|) dB=(1−m)·20 log10(|H(k)|)+m*20 log10 (Hg(k)), where m=0 corresponds to no frequency adjustments due to the background noise. Noteworthy is that the phase response is preserved as is, i.e.
In addition to compensating for the hearing profile HL(fk) of the user, the present invention may optionally be combined with a volume dependent compensation, VDC. As explained, the audio stream 20 comprises a frequency spectrum, and this frequency spectrum of the audio signal will be perceived differently at different playback sound pressure levels. This can be seen when comparing equal loudness contours 300, see
A number of equal loudness contours 300 are shown in
In
Additionally, or alternatively to the hearing profile filtering 105, the method 100 may further comprise determining 136 a playback phons based on a playback volume 15 of the audio stream and/or the detected internal sound 37.
As stated above, if the internal noise 40′ is below the masking threshold T for all frequencies, then no voice enhancement is necessary due to background noise. This can be a the result in e.g., low ambient noise environments or if e.g., the hearing impairment compensation have resulting in a signal level where the corresponding threshold T is above the internal ambient noise 40′
In the following embodiment, the hearing impairment is calculated based on the users hearing profile HL(fk). Utilizing what is known as Speech Intelligibility Index Count-the-dots Audio form4, a further optimization problem may be formed. Given a hearing profile HL(fk), it is desired to provide a filter HHI(k) to adjust the hearing threshold to maximize a standardized Articulation Index, AI, or Speech Intelligibility Index, SII, as given by the count-the-dot audiogram. The AI is calculated as the number of dots below the hearing profile HL(fk) when the hearing profile HL(fk) is plotted on the defined diagram, this is illustrated by the dashed line in
In one embodiment, that can be described as an optimization formulation where a set of gains hHI(k) as function of frequency-bins are optimized to amplify/attenuation and to redistribute the energy of the audio stream such that the intelligibility index is maximized,
The filter HHI(k)=hHI(k) is consequently created from the necessary gains at frequencies k, such that the total energy change in the target audio stream 20′ by the resulting filter is equal to gamma. In one embodiment γ=1 corresponds to an energy redistribution as exemplified in
The VDC is, as mentioned above, in one optional embodiment implemented using a pre-calibrated table making it a stationary method which does not change the filter HHI (k) based on the dynamics of the audio stream 20. Consequently, the filter HHI(k) is only updated if the volume setting, i.e. the playback volume 15 of the audio device 30, is changed. In one embodiment the pre-calibrated table contains scaling factors a(k) for each volume step and frequency-bin that are applied to HHI(k) e.g, a(k)HHI(k). The size of the pre-calibrated table depend on the number of volume step and the number of frequency-bins used in the calculations.
In one embodiment the VDC compute two hearing compensations in the frequency domain, one for each ear. In one additional embodiment, it may be configured to combine the left ear compensation and the right ear compensation in order to provide a single hearing compensation suitable for both ears. This is important when playback utilize several transducers and the sound from each transducer can physically reach each ear of the listener.
In another embodiment of an implementation of the VDC, that may very well be combined with the embodiment above, the feedback microphone signal is used making it a dynamic method that updates the filter HHI (k) based on the levels of the target audio stream 20′ as it is played by the transducer 35. This approach calculates the compensation more often. In order to avoid sharp transients in the audio signal, the update rate of the filter HHI (k) may be kept low, about 0.25-1 Hz.
Although count-the-dot audiogram is provided with the assumption that the signal or speech level is 60 dB SPL, the result and method 100 herein may very well with correct result be scaled with the sound pressure level. As a non-limiting example, a 10 dB increase in volume, corresponding to 10 dB SPL, would correspond to the dotted line of
Other methods of scaling the results and adjusting the hearing impairments based on the count-the-dots audiogram and the teachings here should now be apparent for a person skilled in the art.
As previously mentioned, the inventors behind this disclosure have realised that the internal noise 40′ may be accurately modelled based on the ambient noise 40 by means of Machine Learning. The internal noise 40′ is known from ANC technology to be the external noise 40 filtered by the primary (acoustic) path. Wherein the latter describe the impact on the external noise 40 when propagating from exterior to the headphones 30 into the ear cavity. The primary path is the important (unknown) noise transfer function that must be found in real-time and at high-degree of accuracy for the ANC to operate correctly i.e., for the ANC technology to form the correct anti-noise that cancels (at therefore attenuate) the internal noise 40′. The real-time and accuracy requirements on ANC technology typically dictated that ANC technology is executed in dedicated hardware.
For the present invention, the real-time and accuracy requirements when estimating the internal noise 40′ is much lower compared to ANC technology, as will be described later. Further to this, no dedicated hardware is necessary. Many of the same aspects, known to a person skilled in the art, still exist e.g. the estimation of internal noise 40′ must exclude echo from near-end audio stream when rendered by transducers 35, and instead of complex real-time adaptive filtering and calibrations (as for ANC), a neural network is used to model the primary path including the separation between noise, echo and near-end talker.
The neural network of a preferred embodiment is RNN. RNNs are typically based on Long Short Term Memory, LSTM, or Gated Recurring Units, GRU. In general, a feature vector and a output vector of the RNN may be selected in numerous ways and the selection of the two, the size of the RNN along with the training data quality and the training of RNN will determine the performance of the RNNs ability to output the desired output data given the input, i.e. feature data.
The size of the RNN is set by the training result and the resource constraints posed by the implementation into a real-time audio system. Generally, the size i.e. the number of units and the number of hidden layers of the RNN is a design selection as is the choice of the number of points in an FFT calculation. A typical size of the RNN is e.g. 200-300 units with 3-4 hidden layers. The computational demand of the RNN can be decreased e.g. by selecting a lower order RNN and increasing the error on the output and/or by skipping and structured pruning of RNN units.
It should be noted that the absolute level of the signal provided by the external microphone 5 is important for the application. Hence, an approach for data augmentation may be adopted in which each training example is pre-filtered with a randomized generated second-order filter. Thereby training the RNN to be robust against frequency variations due to an external microphone 5 frequency response tolerance and an external microphone 5 placement variation. Levels of the individual signal in training examples are preferably varied for robustness against level variations.
It is apparent that the above RNN description is but one non-limiting example provided as one working example of the RNN. A person skilled in the art can surely, after digesting the teachings herein, device a different example, varying the feature set, the output set and/or the training e.g. data set, loss function, optimization process etc.
Feature extraction for the RNN may be provided by speech recognition theory. And the following embodiments are to be considered as non-exhaustive examples of feature extractions that are possible to combine with one another in any order or set imaginable.
In one embodiment, the features comprise a discrete cosines transform of the logarithm of the energy per critical band of the signal provided by the microphone 5. In one embodiment, the features comprise the spectrum of the signal provided by the microphone 5, represented in critical bands as described in the present disclosure. In a further embodiment, the features further comprise an average energy in the entire sub-frame. In one embodiment, the features comprise a delta change in amplitude between current sub-frame and previous sub-frame cepstral logarithmic coefficients covering up to at least 600 Hz, preferably up to at least 1000 Hz. Thereby including, with high certainty, the vocal fundamentals of typical voices. In one embodiment, the features comprise a binary signal stating if a non-noise source signal is active or not e.g. a simple level detector to state if user is talking or echo from transducer is present.
The outputs of the RNN typically comprise the average energy values Ev(b), b=1 . . . NB of the internal noise 40′ for each auditory filter band. In one embodiment the outputs of the RNN also comprise, in addition to the above, a binary signal indicating noise or not-noise, a second binary signal indicating low or high level, a third binary signal indicating that a near-end signal is active. Additional signals like the binary signals described might not be directly used in the algorithm to calculate the optimal voice filter 50, but for a person skilled in the art it is easily recognized that relevant outputs such as the described can help to obtain a better result when training the RNN.
In training the RNN, one important aspect is the generalization capabilities of the RNN, such that it will operate correctly even in conditions not used during the training. Therefore, raining examples are preferably made up by a combination of background noise e.g. full-size car 100 km/h and Cafeteria5, echo signal and near-end talk with varying levels and filtrations as stated above. Echo signal and near-end talk signal are preferably independent i.e., the same utterances do not exist simultaneously. 5 ETSI. (2012). Speech and multimedia Transmission Quality (STQ); Speech quality performance in the presence of background noise; Part 1: Background noise simulation technique and background noise database 202 396-1. ETSI.
In one embodiment, the ground truth of the training is constituted by the noise (only) power per auditory filter based on the spectrum of the noise signal (only), divided into auditory filters, and referenced (measured) in the headphone-and-ear cavity. This will therefore include the primary path i.e., the outside the headphone to inside the headphone-and-ear cavity. This is important since the headphones have at least a high-frequency attenuation on the noise due to the acoustic seal when worn (depending on the headphone type, in-ear, over-ear and on-ear). If the headphones also have Active Noise Cancellation (typically operative in the frequency range of 150-900 Hz), the noise outside the headphones are much different from the noise inside the cavity.
A system that can facilitate the presentation of noise, near-end speech and echo and at the same time record the internal noise 40′ (ground truth) is industry standard and the process is fully automated. In one non-limiting example, a scenario is started where background noise is rendered from a multi-speaker setup in an measurement chamber, the headphones under test is situated on the head-and-torso simulator that records the internal noise as the signal that reaches the microphone placed in each ear-simulator. Simultaneously, the background noise is recorded by the external microphone 5 on the headphones 30. After the scenario is completed, each signal is conditioned, time adjusted and converted into either a feature set or a ground truth set.
In summary the usage of machine learning and recurrent neural networks in the modelling of the internal noise 40′ result in a noise estimation in an auditory band model, the removal of near end talk and echo without using complex Voice Activity Detection or Echo cancellation and it will model a primary path from outside the headphone to inside the headphone-and-ear cavity.
Several detailed implementation of the different aspects of the voice filter 50 are presented throughout this disclosure. Regardless of how the voice filter 50 is determined, the voice filter 50, described above as Hvo(k), is applied 140 to the digital source signal, the audio stream 20. There may be a multitude of voice filters 50, each voice filter 50 provides a target audio stream 20′ to be rendered on a transducer 35. There are several approaches to process the audio stream 20 and the person skilled in the art will see, after reading this disclosure, several others then the following two examples. The voice filter 50 may, in one embodiment, be applied 140 by e.g. conversion of frequency function to finite impulse response filter if the phase response is less important, it can be a symmetric impulse response filter resulting in linear phase. The voice filter 50 may, in one embodiment, be applied 140 by multiplication in the frequency domain by an overlap-and-add method to avoid circular convolution when multiplying to frequency functions.
In one preferred embodiment, the audio device 30 comprises at least one voice filter 50 per transducer 35.
In one embodiment, the voice filter 50 providing the target audio stream 20′ is energy normalized. This may lead to high peak amplitudes in the time domain signal. In a further embodiment, the target audio stream 20′ is attenuated to ensure that the signal amplitude of the target audio stream 20′ is not too high for the final signal format. The signal amplitude may then be then transformed into the correct format without distortion using e.g. a standard Limiter or Dynamic Range Controller, DRC. It should be noted that no additional processing apart from controlling the signal amplitude is required. The limiter and DRC can be other components of the digital audio system and are preferably included for the sake of hearing safety.
With reference to
In one embodiment of the method 100, the step of determining 130 the voice filter 50 comprises subtracting 132 the estimated internal noise 40′ from the detected internal sound 37. This will provide a true audio stream 37′ that is what the target audio stream 20′ actually sounds like at the ear of the user. Based on a difference between the target audio stream 20′ and the true audio stream 37′, it is consequently possible to update 138 the voice filter 50 based on this difference. This effectively creates a control loop where it is possible to ensure that the target audio stream 37 is actually what the user hears. This is beneficial as it enables the voice filter may be updated based on e.g. how the audio device is worn by a user and how well the audio device fits at the ear of the user.
In one embodiment of the method 100, the step of determining 130 the voice filter 50 comprises comparing 134 the estimated internal noise 40′ and one or more masking thresholds T. This is substantially comparing 134 the densely dashed line, the estimated internal noise 40′, to the dashed line, the masking threshold T, of
In one embodiment of the method 100, it is configured to compensate for the playback volume 15 of the audio device 30 such as described in reference to
The present invention will, in addition to solving the previously presented problem, provide increased speech intelligibility substantially regardless of how a user of the audio device 30 chooses to carry the audio device 30. Typically, the transducers 35 of an audio device are configured to work at a certain load. This load is in the form of the air cavity between the user and the transducer 35. If the audio device 30 is e.g. a pair of closed headphones, the air cavity is formed by the headphones 30 being carried firmly and tightly about the outer ear of the user. However, as not all ears are the same, and not all users carry their audio devices 30 in the same way, the load of the transducer 35 will differ between users and the thereby the sound of the audio device 30. The present invention solves also this problem by detecting 160 the internal sound 37 which will differ depending on how the audio device 30 is worn.
Number | Date | Country | Kind |
---|---|---|---|
2150611-8 | May 2021 | SE | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2022/050461 | 5/11/2022 | WO |