This application claims priority to EP Application Serial No. 15150040 filed Jan. 2, 2015, the disclosure of which is hereby incorporated in its entirety by reference herein.
The disclosure relates to a sound zone arrangement with speech suppression between at least two sound zones.
Active noise control may be used to generate sound waves or “anti-noise” that destructively interferes with non-useful sound waves. The destructively interfering sound waves may be produced through a loudspeaker to combine with the non-useful sound waves in an attempt to cancel the non-useful noise. Combination of the destructively interfering sound waves and the non-useful sound waves can eliminate or minimize perception of the non-useful sound waves by one or more listeners within a listening space.
An active noise control system generally includes one or more microphones to detect sound within an area that is targeted for destructive interference. The detected sound is used as a feedback error signal. The error signal is used to adjust an adaptive filter included in the active noise control system. The filter generates an anti-noise signal used to create destructively interfering sound waves. The filter is adjusted to adjust the destructively interfering sound waves in an effort to optimize cancellation according to a target within a certain area called sound zone or, in case of full cancellation, quiet zone. In particular closely disposed sound zones as in vehicle interiors may result in more difficulty optimizing cancellation, i.e., in establishing acoustically fully separated sound zones, particularly in terms of speech. In many cases, a listener in one sound zone may be able to listen to a person talking in another sound zone although the talking person does not intend or desire that another person participates. For example, a person on the rear seat of a vehicle (or on the driver's seat) wants to make a confidential telephone call without involving another person on the driver's seat (or on the rear seat). Therefore, a need exists to optimize speech suppression between at least two sound zones in a room.
A sound zone arrangement includes a room including a listener's position and a speaker's position, a multiplicity of loudspeakers disposed in the room, a multiplicity of microphones disposed in the room, and a signal processing module. The signal processing module is connected to the multiplicity of loudspeakers and to the multiplicity of microphones. The signal processing module is configured to establish, in connection with the multiplicity of loudspeakers, a first sound zone around the listener's position and a second sound zone around the speaker's position, and to determine, in connection with the multiplicity of microphones, parameters of sound conditions present in the first sound zone. The signal processing module is further configured to generate in the first sound zone, in connection with the multiplicity of loudspeakers, and based on the determined sound conditions in the first sound zone, speech masking sound that is configured to reduce common speech intelligibility in the second sound zone.
A method for arranging sound zones in a room including a listener's position and a speaker's position with a multiplicity of loudspeakers disposed in the room and a multiplicity of microphones disposed in the room includes establishing, in connection with the multiplicity of loudspeakers, a first sound zone around the listener's position and a second sound zone around the speaker's position, and determining, in connection with the multiplicity of microphones, parameters of sound conditions present in the first sound zone. The method further includes generating in the first sound zone, in connection with the multiplicity of loudspeakers, and based on the determined sound conditions in the first sound zone, speech masking sound that is configured to reduce common speech intelligibility in the second sound zone.
Other systems, methods, features and advantages will be or will become apparent to one with skill in the art upon examination of the following detailed description and figures. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention and be protected by the following claims.
The system may be better understood with reference to the following description and drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
For example, multiple-input multiple-output (MIMO) systems, allow for generating in any given space virtual sources or reciprocally isolated acoustic zones, in this context also referred to as “individual sound zones” (ISZ) or just sound zones. Creating individual sound zones has caught greater attention not only by the possibility of providing different acoustic sources in diverse areas, but especially by the prospect of conducting speakerphone conversations in an acoustically isolated zone. For the distant (or remote) speaker of a telephone conversation this is already possible using present-day MIMO systems without any additional modifications, as these signals already exist in electrical or digital form. The signals produced by the speaker at the other end, however, present a greater challenge, as these signals must be received by a microphone and stripped of music, ambient noise (also referred to as background noise) and other disruptive elements before they can be fed into the MIMO system and passed on to the corresponding loudspeakers.
At this point the MIMO systems, in combination with the loudspeakers, produce a wave field which generates, at specific locations, acoustically illuminated (enhanced) zones, so-called bright zones, and in other areas, acoustically darkened (suppressed) zones, so-called dark zones. The greater the acoustic contrast between the bright and dark zones, the more effective the cross talk cancellation (CTC) between the particular zones will be and the better the ISZ system will perform. Besides the aforementioned difficulties involving extracting the near-speaker's voice signal from the microphone signal(s), an additional problem is the time available for processing the signal, in other words: the latency.
Based on the assumption of ideal conditions, existing, for example, when the near-speaker uses a mobile telephone and talks directly into the microphone and when loudspeakers are positioned in the headrest for use at places where the near-speaker's voice signal should not be audible or, at the very least, understandable, the interval in a luxury-class vehicle is approximately x≦1.5 m which, at the sound velocity of c=343 m/s at a temperature of T=20° C. results in a maximum processing time of approximately ≦4.4 ms. Within this time span everything must be completed; that means the signal must be received, processed and reproduced.
Even the latency that arises over a Bluetooth Smart Technology connection is at t=6 ms already considerably longer than the available processing time. When headrest loudspeakers are employed, an average distance from the speakers to the ears of approximately x=0.2 m can be assumed, and even here a signal processing time of only t<4 ms is available, which may be regarded as a sufficient, but at any rate critical amount of time. And even if enough processing time were at hand to isolate the voice signal from the microphone of the near-speaker and to feed it into a MIMO system, this would not make it possible to accomplish the given task.
Basically, the overall performance, i.e., the degree and also the bandwidth of the CTC of a MIMO system, depends on the distance from the loudspeakers to the areas into which the desired wave field should be projected (e.g., ear positions). Even when loudspeakers are positioned in the headrests, which in reality probably represents one of the best options, i.e., representing the shortest distance possible from the loudspeakers to the ears, it is only possible to achieve a CTC bandwidth of maximum f≦2 kHz. This means that, even under the best of conditions and assuming sufficient cancellation of the near-speaker's voice signal in the driver's seat, with the aid of a MIMO or ISZ system a bandwidth of only ≦2 k Hz can be expected.
However, a voice signal that lies above this frequency still typically possesses so much energy, or informational content, that even speech that is restricted to frequencies above this bandwidth can easily be understood. In addition to this, the natural acoustic masking generally brought about by the ambient noise in a motor vehicle, e.g. road and motor noise, is hardly effective at frequencies above 2 kHz. If looked at realistically, the attempt to achieve a sufficient CTC between the loudspeaker and the ambient space in which a voice should be rendered, at the very least, incomprehensible by using an ISZ system would not be successful.
The approach described herein provides projecting a masking signal of sufficient intensity and spectral bandwidth into the area in which the telephone conversation should not be understood for the duration of the call, so that at least the voice signal of the near-speaker (sitting, for example, on the driver's seat) cannot be understood. Both the near-speaker's voice signal and the voice signal of the distant speaker may be used to control the masking signal. However, another sound zone may be established around a communications terminal (such as a cellular telephone) used by the speaker in the vehicle interior. This additional sound zone may be established in the same or a similar manner as the other sound zones. Regardless which signal (or signals) is used to control the (electrical) masking signal, the employed signal should in no case cause disturbance at the position of the near-speaker he or she should be left completely or at least to the greatest extent possible undisturbed by or unaware of the (acoustic) masking sound based on the masking signal. However, the masking signal (or signals) should be able to reduce speech intelligibility to a level where, for example, a telephone conversation in one sound zone cannot be understood in another sound zone.
Speech Transmission Index (STI) is a measure of speech transmission quality. The STI measures some physical characteristics of a transmission channel, and expresses the ability of the channel to carry across the characteristics of a speech signal. STI is a well-established objective measurement predictor of how the characteristics of the transmission channel affect speech intelligibility. The influence that a transmission channel has on speech intelligibility may be dependent on, for example, the speech level, frequency response of the channel, non-linear distortions, background noise level, quality of the sound reproduction equipment, echoes (e.g., reflections with delays of more than 100 ms), the reverberation time, and psychoacoustic effects (such as masking effects).
More precisely, the speech transmission index (STI) is an objective measure based on the weighted contribution of a number of frequency octave bands within the frequency range of speech. Each frequency octave band signal is modulated by a set of different modulation frequencies to define a complete matrix of differently modulated test signals in different frequency octave bands. A so-called modulation transfer function, which defines the reduction in modulation, is determined separately for each modulation frequency in each octave band, and subsequently the modulation transfer function values for all modulation frequencies and all octave bands are combined to form an overall measure of speech intelligibility. It also has been recognized that there is a benefit in moving from subjective evaluation of the intelligibility of speech in a region toward a more quantitative approach which, at the very least, provides a greater degree of repeatability.
A standardized quantitative measure of speech intelligibility is the Common Intelligibility Scale (CIS). Various machine-based methods such as Speech Transmission Index (STI), Speech Transmission Index Public Address (STI-PA), Speech Intelligibility Index (SII), Rapid Speech Transmission Index (RASTI), and Articulation Loss of Consonants (ALCONS) can be mapped to the CIS. These test methods have been developed for use in evaluating speech intelligibility automatically and without any need for human interpretation of the speech intelligibility. For example, the Common Intelligibility Scale (CIS) is based on a mathematical relation with STI according to CIS=1+log (STI). It is understood that the common speech intelligibility is sufficiently reduced if the level is below 0.4 on the common intelligibility scale (CIS).
Referring to
The signal processing module 104 includes, for example, a MIMO system 110 that is connected to the multiplicity of loudspeakers 102, the multiplicity of microphones 103, the masking noise mn(n), and a useful signal source such as a stereo music signal x(n) providing stereo signal source 111. MIMO systems may include a multiplicity of outputs (e.g., output channels for supplying output signals to a multiplicity of groups of loudspeakers) and a multiplicity of (error) inputs (e.g., recording channels for receiving input signals from a multiplicity of groups of microphones, and other sources). A group includes one or more loudspeakers or microphones that are connected to a single channel, i.e., one output channel or one recording channel. It is assumed that the corresponding room or loudspeaker-room-microphone system (a room in which at least one loudspeaker and at least one microphone is arranged) is linear and time-invariant and can be described by, e.g., its room acoustic impulse responses. Furthermore, a multiplicity of original input signals such as the useful (stereo) input signals x(n) may be fed into (original signal) inputs of the MIMO system. The MIMO system may use, for example, a multiple error least mean square (MELMS) algorithm for equalization, but may employ any other adaptive control algorithm such as a (modified) least mean square (LMS), recursive least square (RLS), etc. Useful signal(s) x(n) may be filtered by a multiplicity of primary paths, which are represented by a primary path filter matrix on its way from one of the multiplicity of loudspeakers 102 to the multiplicity of microphones 103 at different positions, and provides a multiplicity of useful signals d(n) at the end of the primary paths, i.e., at the multiplicity of microphones 103. In the exemplary arrangement shown in
The signal processing module 104 further includes, for example, an acoustic echo cancellation (AEC) system 112. In general, acoustic echo cancellation can be attained, e.g., by subtracting an estimated echo signal from the useful sound signal. To provide an estimate of the actual echo signal, algorithms have been developed that operate in the time domain and that may employ adaptive digital filters processing time-discrete signals. Such adaptive digital filters operate in such a way that the network parameters defining the transmission characteristics of the filter are optimized with reference to a preset quality function. Such a quality function is realized, for example, by minimizing the average square errors of the output signal of the adaptive network with reference to a reference signal. Other AEC modules are known that are operated in the frequency domain. In the exemplary arrangement shown in
AEC module 112 receives output signals MicL(n,k) and MicR(n,k) of two microphones 103a and 103b of the multiplicity of microphones 103, wherein these particular microphones 103a and 103b are arranged in the vicinity of two particular loudspeakers 102a and 102b of the multiplicity of loudspeakers 102. The loudspeakers 102a and 102b may be disposed in the headrests of a (vehicle) seat in the room (e.g., the interior of a vehicle). The output signal MicL(n,k) may be the sum of a useful sound signal SL(n,k), a noise signal NL(n,k) representing the ambient noise present in the room 101 and a masking signal ML(n,k) representing the masking signal based on the masking noise signal mn(n). Accordingly, the output signal MicR(n,k) may be the sum of a useful sound signal SR(n,k), a noise signal NR(n,k) representing the ambient noise present in the room 101 and a masking signal MR(n,k) representing the masking signal based on the masking noise signal mn(n). AEC module 112 further receives the stereo signal x(n) and the masking signal mn(n), and provides an error signal E(n,k), an output (stereo) signal PF(n,k) of an adaptive post filter within the AEC module 112 and a (stereo) signal {tilde over (M)}(n,k) representing the estimate of the echo signal(s) of the useful signal(s). It is understood that ambient/background noise includes all types of sound that does not refer to speech sound to be masked so that ambient/background noise may include noise generated by the vehicle, music present in the interior and even speech sound of other persons who do not participate in the communication in the speaker's sound zone. It is further understood that no further masking sound is needed if the ambient/background noise provides sufficient masking.
The signal processing module 104 further includes, for example, a noise estimation module 113, noise reduction module 114, gain calculation module 115, masking modeling module 116, and masking signal calculation module 117. The noise estimation module 113 receives the (stereo) error signal E(n,k) from AEC module 112 and provides a (stereo) signal Ñ(n,k) representing an estimate of the ambient (background) noise. The noise reduction module 114 receives the output (stereo) signal PF(n,k) from AEC module 112 and provides a signal {tilde over (S)}(n,k) representing an estimate of the speech signal as perceived at the listener's ear positions. Signals {tilde over (M)}(n,k), {tilde over (S)}(n,k) and Ñ(n,k) are supplied to the gain calculation module 115, which is also supplied with a signal I(n) and which supplies the power spectral density P(n,k) of the near speaker's speech signals as perceived at the listener's ear positions based on the signals {tilde over (M)}(n,k), {tilde over (S)}(n,k) and Ñ(n,k), to the masking modeling module 116. Alternatively to the masking model or additionally a common intelligibility model may be used. The masking modeling module 116 provides a signal G(n,k) which represents the masking threshold of the power spectral density P(n,k) of the estimated near speaker's speech signals as perceived at the listener's ear positions, exhibiting the magnitude frequency response of the desired masking signal. By combining signal G(n,k) with a white noise signal wn(n), which is provided by white noise source 105 and which delivers the phase frequency response of the desired masking signal, in masking signal calculation module 117 the masking signal mn(n) will be generated, which is then, inter alia, provided to the MIMO system 110. The signal processing module 104 further includes, for example, a switch control module 118, which receives the output signals of the multiplicity of microphones 103 and a signal DesPosIdx, and which provides the signal I(n).
In a room, which, in the present example, is the cabin of a motor vehicle, a multitude of loudspeakers are positioned, together with microphones. In addition to the existing system loudspeakers, (acoustically) active headrests may also be employed. The term “Active Headrest” refers to a headrest into which one or more loudspeakers and one or more microphones are integrated such as the combinations of loudspeakers and microphones described above (e.g., combinations 217-220). The loudspeakers positioned in the room are used, i.a., to project useful signals, for example music, into the room. This leads to the formation of echoes. Again, “echo” refers to a useful signal (e.g. music) that is received by a microphone located in the same room as the playback loudspeaker(s). The microphones positioned in the room record useful signals as well as other signals, such as ambient noise or speech. The ambient noise may be generated by a multitude of sources, such as road traction, ventilators, wind, the engine of the vehicle or it may consist of other disturbing sound entering the room. The speech signals, on the other hand, may come from any passengers present in the vehicle and, depending on their intended use, may be regarded either as useful signals or as sources of disruptive background noise.
The signals from the two microphones integrated into the headsets and positioned in regions in which a telephone call should be rendered unintelligible must first of all be cleansed of echoes. For this purpose, in addition to the aforementioned microphone signals, corresponding reference signals (in this case useful stereo signals such as music signals and a masking signal, which is generated) are fed into the AEC module. As output signals the AEC module provides, for each of the two microphones, a corresponding error signal EL/R(n, k) from the adaptive filter, an output signal of the adaptive post filter PFL/R(n, k), and the echo signal of the useful signal (e.g. music) as received by the corresponding microphone {tilde over (M)}L/R(n, k).
In the noise estimation module 113 the (ambient) noise signal ÑL/R(n, k) present at each microphone position is estimated based on the error signals EL/R(n, k). In the noise reduction module 114 a further reduction of ambient noise is carried out based on the output signals of the adaptive post filters PFL/R(n, k), which also suppress what is left of the echo and part of the ambient noise. The output, then, from the noise reduction module 114 is an estimate of the speech signal k) coming from the microphones that has been largely cleansed of ambient noise. Using the thus obtained isolated estimates of the useful signal's echo signal {tilde over (M)}L/R(n, k), the background noise signal ÑL/R(n, k) and of the speech signal {tilde over (S)}(n, k) as found in the area in which the conversation is to be rendered unintelligible, together with the signal I(n) (which will be discussed in greater detail further below), the power spectral density P(n,k) is calculated in the module Gain Calculation. On the basis of these calculations, the magnitude frequency response value of the masking signal G(n,k) is then calculated. The power spectral density P(n,k) should be configured to ensure that a masking signal is only generated when the near or distant speaker is active and only in the spectral regions in which conversation is taking place. Essentially, the power spectral density P(n,k) could also be directly used to generate the frequency response value of the masking signal G(n, k), however, because of the high, narrowband dynamics of this signal, this could result in a signal being generated that does not possess sufficient masking qualities. For this reason, instead of using the power spectral density P(n,k) directly, its masking threshold G(n,k) is used to produce the magnitude frequency response value of the desired masking signal.
In the masking model module 116, the input signal, which is the power spectral density P(n,k), is used to calculate the masking threshold of the masking signal G(n,k) on the basis of the masking model implemented there. The high narrowband dynamic peaks of the power spectral density P(n,k) are clipped by the masking model, as a result of which the masking in these narrow spectral regions becomes insufficient. To compensate for this, a spread spectrum is generated for the masking signal in the spectral area surrounding these spectral peaks, which once again intensifies the masking effect locally, so that, despite the fact that this limits the dynamics of the masking signal, its effective spectral width is enhanced. A thus generated, time and spectral variant masking signal exhibits a minimum bias and is therefore met with greater acceptance by users. Furthermore, in this way the masking effect of the signal is enhanced.
In the masking signal calculation module 117 a white-noise phase frequency response of the white noise signal (wn(n) is superimposed over the existing magnitude frequency response of the masking signal G(n,k), producing a complex masking signal which can then be converted from the spectral domain into the time domain. The end result of this is the desired masking signal mn(n) in time domain, which, on the one hand, can be projected through the MIMO system into the corresponding bright-zone and, on the other hand, must be fed into the AEC module as an additional reference signal, in order to cancel out the echo it causes in the microphone signals and to prevent feedback problems.
The switch control module 118 receives all microphone signals present in the room as its input signals and, based on these, furnishes at its output the time variant, binary weighted signal I(n). This signal indicates whether (I(n)=1) or not (I(n)=0) the estimated speech signal {tilde over (S)}(n,k) originates from the desired position DesPosIdx, which in this case is the position of the near speaker. Only when the thus estimated position of the source of speech corresponds to the known position of the near speaker DesPosIdx, assumed by default or choice, will a masking signal be generated, otherwise, i.e., when the estimated speech signal {tilde over (S)}(n,k) contained in the microphone originates from another person in the room, the generation of a masking signal will be prevented. Of course, data from seat detection sensors or cameras could also be evaluated, if available, as an alternative or additional source of input. This would simplify the process considerably and make the system more resistant against potential errors when detecting the signal of the near speaker.
Referring to
As shown in
The upper right section of
Each loudspeaker contributes to the microphone signal and the echo signal included therein in that the signals broadcasted by the loudspeakers are received by each of the microphones after being filtered with a respective room impulse response (RIR) and superimposed over each other to form a respective total echo signal. For example, the average RIR of the left channel signal xL(n) of the stereo signal x(n) from the respective loudspeaker to the left microphone can be described as:
and for the left channel signal xL(n) of the studio signal x(n) from the respective loudspeaker to the right microphone as:
Accordingly, the average RIR of the right channel signal xR(n) of the stereo signal x(n) from the respective loudspeaker to the right microphone can be described as:
and for the right channel signal xR(n) of the studio signal x(n) from the respective loudspeaker to the left microphone as:
Additionally, masking signal mn(n) generates an echo which is also received by the two microphones.
A typical situation, in which a speaker sits on one of the rear seats and a listener sits on one of the front seats and the listener should not understand what the speaker on the rear seat says and masking sound is radiated from loudspeakers in the headrest of the listener's seat, is depicted in
and the average RIR
The following description is based on the assumption that the speaker sits on the right rear seat and the listener on the left front seat (driver's seat), wherein the listener should not understand what the speaker says. Any other constellations of speaker and listener positions are applicable as well. Under the above circumstances the total echo signals EchoL(n) and EchoR(n) received by the left and right microphones are as follows:
EchoL(n)=xL(n)*
EchoR(n)=xL(n)*
wherein “*” is a convolution operator.
In case of K=3 uncorrelated input signals xL(n), xR(n) and mn(n) and I=2 microphones (in the headrest), K·I=6 different independent adaptive systems are established, which may serve to estimate the respective RIRs
The echoes of the useful signal as recorded by the left microphone which outputs signal mL(n) and the right microphone which outputs signal mL(n), serve as first output signals of the AEC module 300 and can be estimated as follows:
{tilde over (m)}L(n)=xL(n)·
{tilde over (m)}R(n)=xL(n)·
The error signals eL(n), eR(n) serve as second output signals of the AEC module 300 and can be calculated as follows:
eL(n)=MicL(n)−(xL(n)*
eR(n)=MicR(n)−(xL(n)*
From the above equations it can be seen that the error signals eL(n) and eR(n) ideally contain only potentially existing noise or speech signal components. The error signals eL(n) and eR(n) are supplied to the post filter module 409, which outputs third output signals pfL(n) and pfR(n) of the AEC module 300 which can be described as:
pfL(n)=eL(n)*pL(n), and
pfR(P)=eR(n)*pR(n)
The adaptive post filter 409 is operated to suppress potentially residual echoes present in the error signals eL(n) and eR(n). The residual echoes are convolved with coefficients pL(n) and PR(n) of the post filter 409, which serves as a type of time invariant, spectral level balancer. In addition to the coefficients pL(n) and pR(n) of the adaptive post filter the adaptive step size
Input signals Xk(ejΩ,n):
Xk(ejΩ,n)=FFT{xk(n)},
wherein
xk(n)=[xk(nL−N+1), . . . ,xk(nL+L−1)]T,
xk(n)=[x0(n),x1(n),x2(n)]=[mn(n),xL(n),xR(n)],
L is the block length, N is length of the adaptive filter, M=N+L−1 is the length of the fast Fourier transformation (FFT), k=K−1, and K is the number of uncorrelated input signals.
Echo signals yi(n):
yi,Comp(n)={IFFT{Σk=0K-1Xk(ejΩ,n){tilde over (W)}k,i(ejΩ,n)}},
wherein
yi(n)=[yi,Comp(M−L+1), . . . ,yi,Comp(M)]T,
which is a vector that includes the final L elements of yi,Comp(M), I=[0, . . . , I−1], and
Error signals ei(n):
0 is a zero column vector with length M/2, and em(n) is an error signal vector with length M/2.
Input signal energy pi(ejΩ, n):
pi(ejΩ,n),
pi(ejΩ
pi(ejΩ
pi(ejΩ
α is a smoothing coefficient for the input signal energy and pMin is a valid minimal value of the input signal energy.
Adaption step size μi(ejΩ,n) [part 1]:
Adaption:
Wk,i(ejΩ,n)={tilde over (W)}k,i(ejΩ,n−1)+diag{μi(ejΩ,n)}diag{Xk*(ejΩ,n)}Ei(ejΩ,n),
wherein
Wk,i (ejΩ, n) are the coefficients of the adaptive without constraint,
{tilde over (W)}k,i(ejΩ, n) are the coefficients of the adaptive with constraint,
diag{x} is the diagonal matrix of vector x, and
x is the conjugate complex value of the (complex) value x.
Constraint:
wherein
{tilde over (w)}k,i(n) is a vector with the first M/2 elements of {IFFT{Wk,i(ejΩ, n+1)}}.
System distance Gi(ejΩ, n):
Gi(ejΩ
Δi(ejΩ
Gi(ejΩ,n)=[G0(ejΩ,n),G1(ejΩ,n)]=[GL(ejΩ,n),GR(ejΩ,n)],
Δi(ejΩ,n)=[Δ0(ejΩ,n),Δ1(ejΩ,n)]=[ΔL(ejΩ,n),ΔR(ejΩ,n)],
wherein
C is the constant which determines the sensitivity of DTD.
Adaption step size μi(ejΩ,n) [part 2]:
wherein
m=[0, . . . , M−1], Pi(ejΩ, n), μMax is the upper permissible limit and μMin is the lower permissible limit of μi (ejΩ
Adaptive post filter Pi (ejΩ
Pi(ejΩ
PFi(ejΩ
Pi(ejΩ
Pi(ejΩ
wherein
PMax(ejΩ,n)=(ejΩ
PMin(ejΩ,n)=(ejΩ
Pi(ejΩ,n)=[P0(ejΩ,n),P1(ejΩ,n)]=[PL(ejΩ,n),PR(ejΩ,n)], and
PFi(ejΩ,n)=[PF0(ejΩ,n),PF1(ejΩ,n)]=[PFL(ejΩ,n),PFR(ejΩ,n)].
Thus, the output signals of the AEC module can be described as follows:
Echoes {tilde over (M)}L(ejΩ, n), {tilde over (M)}R (ejΩ, n) of the useful signals are calculated according to
{tilde over (M)}L(ejΩ,n)=XL(ejΩ,n)+
{tilde over (M)}R(ejΩ,n)=XL(ejΩ,n)
Calculating in the spectral domain the useful signal echoes contained in the microphone signals allows for determining what intensity and coloring the desired signals have at the locations where the microphones are disposed, which are the locations where the speech of the near-speaker should not be understood (e.g., by a person sitting at the driver position). This information is important for evaluating whether the present useful signal (e.g., music) at a discrete point in time n is sufficient to mask an possibly occurring signal from the near-speaker so that the speech signal cannot be heard at the listener's position e.g., driver position). If this is true no additional masking signal mn(n) needs to be generated and radiated to or at the driver position.
Error Signals EL(ejΩ, n), ER(ejΩ, n):
The error signals EL(ejΩ, n), ER(ejΩ, n) include, in addition to minor residual echoes, an almost pure background noise signal and the original Signal from the close speaker.
Output Signals PFL(ejΩ, n), PFR (ejΩ, n) of the Adaptive Post Filter:
In contrast to the error signals EL(ejΩ, n), ER(ejΩ, n) the output signals PFL(ejΩ, n), PFR(ejΩ, n) of the adaptive post filter contain no significant residual echoes due the time-invariant, adaptive post filtering which provides a kind of spectral level balancing. Post filtering has almost no negative influence on the speech signal components of the near-speaker contained in the output signals PFL(ejΩ, n), PFR(ejΩ, n) of the adaptive post filter but rather on the also contained background noise. The coloring of the background noise is modified by post filtering, at least when active useful signals are involved, so that the background noise level is finally reduced and, thus, the modified background noise cannot serve as a basis for an estimation of the background noise due to the modification. For this reason, the error signals EL(ejΩ, n), ER(ejΩ, n) may be used to estimate the background noise Ñ(ejΩ, n), which may form basis for the evaluation of the masking effect provided by the (stereo) background noise.
The sole input signals of noise estimation module 500 are the error signals EL(n,k) and ER(n,k) from the two microphones coming from the AEC module. Why precisely these signals are being used for the estimation was explained further above. From
The power of each input signal, error signals EL(n,k) and ER(n,k) is determined by calculating (estimating) their power spectral densities |EL(n, k)2|, |ER(n, k)2| and then formulating their maximum value, maximum power spectral density |E(n, k)2|. Optionally, maximum power spectral density |E(n, k)2| may be smoothed over time, in which case the smoothing will depend on whether the maximum power spectral density |E(n, k)2| is rising or falling. If the maximum power spectral density is rising, the smoothing coefficient τTUp is applied, if it is falling the smoothing coefficient τTDown is used. Another option is to smooth the maximum power spectral density |E(n, k)2| over time, which then serves as the input signal for the spectral smoothing module 604, where the signal undergoes spectral smoothing. In the spectral smoothing module 604 it is then decided whether the smoothing is to be carried out from low to high (τSUp active), from high to low (τSDown active), or whether the smoothing should take place in both directions. A spectral smoothing in both directions, which is carried out using the same smoothing coefficient (τSUp=τSDown), may be appropriate when a spectral bias should be prevented. As it may be desirable to estimate the background noise as authentically as possible, spectral distortions may be inadmissible, necessitating in this case a spectral smoothing in both directions.
Then, spectrally smoothed maximum power spectral density Ê(n, k) is fed into the non-linear smoothing module 605. In the non-linear smoothing module 605, any abrupt disruptive noise still remaining in the spectrally smoothed maximum power spectral density Ê(n, k), such as conversation, the slamming of doors or tapping on the microphone, is suppressed.
The non-linear smoothing module 605 in the arrangement shown in
{tilde over (N)}(n,k)={[{tilde over (N)}(n,k),MinNoiseLevel]}.
If the echoes of the useful signals, estimations of which may be taken directly from the AEC module, or the estimated background noise, as derived from the noise estimation module, do not provide adequate masking of the speech signal in the region in which the conversation should not be understood, then a masking signal mn(n) is calculated. For this, the speech signal component {tilde over (S)}(n, k) within the microphone signal is estimated, as this serves as the basis for the generation of the masking signal mn(n). One possible method for determining the speech signal component {tilde over (S)}(n, k) will be described below.
As may be deducted from
As the first part a beamformer is used, which essentially amounts to a delay and sum beamformer, in order to take advantage of its spatial filter effect. This effect is known to bring about a reduction in ambient noise, (depending on the distance dMic between the microphones), predominantly in the upper spectral range. Instead of compensating for the delay, as is typically done when a delay and sum beamformer is used, here a time variable, spectral phase correction is carried out with the aid of an all-pass filter A(n,k), calculated from the input signals according to the following equation:
Before performing the calculation it should be ensured that both channels have the same phase in relation to the speech signal. Otherwise a partially destructive overlapping of speech signal components will lead to the unwanted suppression of the speech signal, lowering the quality of the signal-to-noise ratio (SNR). The following signal is provided at the output of the all-pass filter:
PFL(n,k)A(n,k)=|PFL(n,k)|ej{PF
When employing the phase correction segment A(n,k) only the magnitude frequency response value of the signal-supplying microphone (in this case the signal |PFL(n,k)|, originating in the left microphone) is provided at the output, although the angular frequency response value from the other microphone (here{PFR(n,k)}, from the right microphone) is used. In this manner, coherent incident signal components, such as those of the speaker, remain untouched, whereas other incoherent incident sound elements, such as ambient noise, are reduced in the calculation. The maximum attenuation that can generally be reached using a delay and sum beamformer is 3 dB, whereas, at a microphone distance of dMic=0.2 [m] (roughly corresponding to the distance to the microphone in a headrest), and a sound velocity of cθ-20° C.=343 ms, this can only be achieved at or above a frequency of:
which illustrates the calculation of the cutoff frequency f, beyond which point the noise-suppressing effect from the spatial filtering of a non-adaptive beamformer with two microphones, positioned at the distance dMic, becomes apparent. Because of the fact that ambient noise in a motor vehicle lies in the dark red spectral segments, meaning that its components are predominantly made up of sound with a lower frequency, (in the range of approximately f<1 kHz), the noise suppression of the beamformer, that is, its spacial filtering, which only affects high-frequency noise, can obviously only suppress certain parts of the ambient noise, such as the sounds coming from the ventilator or an open window.
The second part of the noise suppression that takes place in the noise reduction module 800 is performed with the aid of an optimum filter, the Wiener Filter with a transfer function W(n,k), which carries out the greater portion of the noise reduction, in particular, as mentioned above, in motor vehicles. The transfer function W(n,k) of the Wiener Filter can be calculated as follows:
wherein
W(n, k)=max{WMin, W(n, k)},
W(n, k)=min{WMax, W(n, k)},
WMax=upper admissable limit of W(n, k),
WMin=lower admissable limit of W(n, k).
From the above equation it can be seen that the Wiener Filter's transfer function W(n,k) should also be restricted and that the limitation to the minimally admissible value is of particular importance. If transfer function W(n,k) is not restricted to a lower limit of WMin≈−12 dB, . . . , −9 dB, the result will be the formation of so-called “musical tones”, which will not necessary have an impact on the masking algorithm, but will at least then become important when one wishes to provide the extracted speech signal, for example, when applying a speakerphone algorithm. For this reason, and because it does not negatively affect the Sound Shower algorithm, the restriction is provided at this stage. The output signal S(n,k) of the noise reduction module 800 may be calculated according to the following equation:
Applying the scaling factor NoiseScale, with Noise Scale ≧1, for the weighting of the estimated ambient noise signal Ñ(n, k), produces the following results: The higher the scaling factor NoiseScale chosen, the lesser the risk of the ambient noise mistakenly being estimated as speech. The sensitivity of the speech detector, however, is reduced in the process, increasing the probability that the speech elements actually contained in the microphone signals will not be correctly detected. Speech signals at lower levels thereby run a greater risk of not generating a masking noise.
As already mentioned, the time variable spectra of the maximum value {circumflex over (N)}(n, k) and the estimated speech signal Ŝ(n, k) are passed on to the comparison module 1107 where a comparison is made between the spectral progression of the estimated speech signal Ŝ(n, k) and the spectrum of the estimated ambient noise {circumflex over (N)}(n, k).
The estimated speech signal Ŝ(n, k) is only used as the output signal {circumflex over (P)}(n, k), so that {circumflex over (P)}(n, k)=Ŝ(n, k), when it is larger than the maximum value {circumflex over (N)}(n, k), meaning larger than the maximum value of the useful signal's echo {circumflex over (M)}(n, k) and the background noise {circumflex over (N)}(n, k). Otherwise, no output signal {circumflex over (P)}(n, k) will be formed, i.e., {circumflex over (P)}(n, k)=0 will be used as an output signal. Putting it in other words: Only in those cases in which the ambient noise signal and/or the music signal (useful signal echo) is (are) insufficient for a “natural” masking of the existing speech signal will an additional masking noise mn(n) be generated and its frequency response value P(n,k) be determined. The output signal {circumflex over (P)}(n, k) of the comparison module 1107 may not be directly applied here, as at this point it is not yet known from which speaker the signal originates. Only if the signal originates from the near-speaker, sitting, for example, on the right back seat, may the masking signal mn(n) be generated. In other cases, e.g. when the signal originates from a passenger sitting on the right front seat, it should not be generated. However, this information is represented by the weighting signal I(n), with which output signal {circumflex over (P)}(n, k) is weighted in order to obtain the output signal of the Gain Calculation Block, i.e., detected speech signal P(n,k). Ideally, detected speech signal P(n,k) should only contain the power spectral density of the near-speaker's voice as perceived at the listener's ear positions, and this only when it is larger than the music or ambient noise signal present at the time at these very positions.
As shown in
The thus spectrally limited microphone signals are then smoothed over time in temporal smoothing modules 1204 to provide P smoothed microphone signals m1(n), . . . , mP(n). Here a classic smoothing filter such as, for example, an infinite impulse response (IIR) low-pass filter of first order may be used in order to conserve energy. P index signals I1(n), . . . , IP(n) are then generated by a module 1205 from the P smoothed microphone signals m1(n), . . . , mP(n), which are digital signals and therefore can only assume a value of 1 or 0, whereas at the point in time n, only the signal possessing the highest level may take on the value of 1 representing the maximum microphone level over positions. As previously mentioned, the signal processing may be mainly carried out in the spectral range. This implicitly presupposes a processing in blocks, the length of which is determined by a feeding rate. Subsequently in a module 1206 a histogram is compiled out of the most recent L samples of index vectors Ip(n), with
Ip(n)=[Ip(n−L+1), . . . ,Ip(n)] and p=[1, . . . ,P],
meaning that the number of times at which the maximum speech signal level appeared at the position P is counted. These counts are then passed on to a maximum detector module 1207 in the form of the signals Î1(n), . . . , Îp(n) at each time interval n. In the maximum detector module 1207 the signal with the highest count Ĩ1(n) at the time point n is identified and passed on to a comparison module 1208, where it is compared with the variable DesPosIdx, i.e., with the presupposed position of the near-speaker. If Ĩ1(n) and DesPosIdx correspond, this is confirmed with an output signal I(n)=1, if it is otherwise determined that the estimated speech signal Ŝ(n, k) does not originate at the position of the near-speaker, i.e., that Ĩ1(n)≠DesPosIdx, I(n) becomes 0.
As can be seen in
As can be seen in
In a module for renormalization of the spread spectrum estimate the absolute masking threshold T(n,m) is renormalized, which is necessary as an error is formed in the spreading block when the spreading function Sm) is applied, consisting in an unwarranted increase of the signals entire energy. Based on the spreading function S(m), the renormalization value Ce(n,m) is calculated in the module 1506 for renormalization of the spread spectrum estimate and is then used to correct the absolute masking threshold T(n,m) in an module 1507 for the renormalization of the masked threshold, finally producing the renormalized, absolute masking threshold Tn(n,m). In a transform to SPL module 1508, a reference sound pressure level (SPL) value SPLRef is applied to the renormalized, absolute masking threshold Tn(n,m) to transform it into the acoustic sound pressure signal TSPL(n,m) before being fed into a Bark gain calculation module 1509, where its value is modified only by the variable GainOffset, which can be set externally. The effect of the parameter GainOffset can be summed up as follows: the larger the variable GainOffset is, the larger the amplitude of the resulting masking signal nm(n) will be. The sum of signal TSPL(n,m) and variable GainOffset may optionally be smoothed over time in a temporal smoothing module 1510, which may use a first order IIR low-pass filter with the smoothing coefficient β. The output signal from the temporal smoothing module 1510, which is a signal BG(n,m), is then converted from the Bark scale into the linear spectral range, finally resulting in the frequency response of the masking noise G(n,k). The masking model module 1400 may be based on the known Johnston Masking Model which calculates the masked threshold based on an audio signal in order to predict which components of the signal are inaudible.
Referring back to
Referring now to
As may be seen in
Referring to
Referring to
It is understood that modules as used in the systems and methods described above may include hardware or software or a combination of hardware and software.
While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
15150040 | Jan 2015 | EP | regional |
Number | Name | Date | Kind |
---|---|---|---|
7433821 | Obranovich et al. | Oct 2008 | B2 |
8126159 | Goose | Feb 2012 | B2 |
8731907 | Cheng | May 2014 | B2 |
20060009969 | L'Esperance | Jan 2006 | A1 |
20090074199 | Kierstein | Mar 2009 | A1 |
20100124337 | Wertz | May 2010 | A1 |
20100189275 | Christoph | Jul 2010 | A1 |
20110002477 | Zickmantel | Jan 2011 | A1 |
20130185061 | Arvanaghi et al. | Jul 2013 | A1 |
20140348354 | Christoph | Nov 2014 | A1 |
Number | Date | Country |
---|---|---|
2471674 | Dec 2005 | CA |
1770685 | Apr 2007 | EP |
2211564 | Jul 2010 | EP |
Entry |
---|
Extended European Search Report for corresponding Application No. 15150040.2, mailed Jul. 22, 2015, 6 pages. |
Modegi, “Auditory Masking Control System for Protecting Speech Privacy by Playing Back Filtered BGM Sounds With Flat-Panel Loudspeakers”, SICE Annual Conference, Sep. 14-17, 2013, Nagoya University, Nagoya, Japan, pp. 1663-1670. |
Wikipedia, “Speech Transmission Index”, Oct. 8, 2014, 6 pages. |
Number | Date | Country | |
---|---|---|---|
20160196818 A1 | Jul 2016 | US |