The present invention generally relates to systems and methods for speech enhancement using acoustic sensor arrays.
Speech enhancement using microphone arrays is a known in the art technique, in which the microphones are typically arranged in a line for synchronizing the delays thereof according to distance of each microphone from the speaker, such as shown in
The formula for a homogenous linear array beam pattern is:
and the response function (attenuation in dB) is given in the graph shown in
Affes et al. (1997) teaches a signal subspace tracking algorithm for microphone array speech processing for enhancing speech in adverse acoustic environments. This algorithm proposes a method of adaptive microphone array beamforming using matched filters with signal subspace tracking for enhancement of near field speech signals by the reduction of multipath and reverberations. This method is mainly targeted at reducing the reflections and reverberations of sound sources that do not propagate along direct paths such as in cases of microphones of hand held mobile devices. The setup that was used in this work by Affes et al. (1997) is discussed at Sec. II.A. Twelve microphones were positioned on the screen of a computer workstation, with spacing of 7 cm between each pair.
Jan et al (1996) teaches microphone arrays and signal processing for high quality sound capture in noisy reverberant enclosures that incorporates matched filtering of individual sensors and parallel processing for providing spatial volume selectivity that mitigates noise interference and multipath distortion. This technique uses randomly distributed transducers.
Capon (1969) teaches a high-resolution frequency-wavenumber spectrum analysis, which is referred to as the minimum variance distortionless response (MVDR) beamformer. This well-known algorithm is used to minimize the noise received by a sensor array, while preserving the desired source without distortion.
U.S. Pat. No. 7,809,145 teaches methods and apparatus for signal processing. A discrete time domain input signal xm(t) is produced from an array of microphones M0 . . . MM. A listening direction may be determined for the microphone array. The listening direction is used in a semi-blind source separation to select the finite impulse response filter coefficients b0, b1 . . . , bN to separate out different sound sources from input signal xm(t). One or more fractional delays may optionally be applied to selected input signals xm(t) other than an input signal x0(t) from a reference microphone M0.
U.S. Pat. No. 8,204,247 teaches an audio system generates position-independent auditory scenes using harmonic expansions based on the audio signals generated by a microphone array. Audio sensors are mounted on the surface of a sphere. The number and location of the audio sensors on the sphere are designed to enable the audio signals generated by those sensors to be decomposed into a set of eigenbeam outputs. Compensation data corresponding to at least one of the estimated distance and the estimated orientation of the sound source relative to the array are generated from eigenbeam outputs and used to generate an auditory scene. Compensation based on estimated orientation involves steering a beam formed from the eigenbeam outputs in the estimated direction of the sound source to increase direction independence, while compensation based on estimated distance involves frequency compensation of the steered beam to increase distance independence.
U.S. Pat. No. 8,005,237 teaches beamforming post-processor technique with enhanced noise suppression capability. The beam forming post-processor technique is a non-linear post-processing technique for sensor arrays (e.g., microphone arrays) which improves the directivity and signal separation capabilities. The technique works in so-called instantaneous direction of arrival space, estimates the probability for sound coming from a given incident angle or look-up direction and applies a time-varying, gain based, spatio-temporal filter for suppressing sounds coming from directions other than the sound source direction resulting in minimal artifacts and musical noise.
The present invention provides a system for enhancing acoustic performances of at least one acoustic source in an adverse acoustic environment. According to some embodiments of the invention, the system comprises: (i) an array of acoustic sensors, with each sensor having a different directivity; and (ii) an analysis module being configured for optimizing signal enhancement of at least one source, by correlating the sensors according to respective position of the at least one source in respect to the directivity of the acoustic sensors. The analysis is based on reflections from reverberating surfaces in the specific acoustic environment, allowing outputting a clean source-enhanced signal, wherein the optimization and sensors directivity allow maintaining the sensor array in compact dimensions without affecting signal enhancement and separation.
According to some embodiments, different directivity of each sensor is achieved by at least one of: (i) arranging the sensors in the array such that each is directed to a different direction; (ii) using sensors having different frequency sensitivity.
According to some embodiments, the analysis module computes a statistical estimate of a source signal using cross-correlation and auto-correlation of the signals from the acoustic sensors, containing both the desired source and a corrupting noise signal, using cross-correlation and auto-correlation of an interrupting noise signal alone, wherein the output estimate is given by using a minimum variance distortionless response (MVDR) beamformer.
According to some embodiments, the system further comprises a learning module configured for adaptive learning of the acoustic characteristics of the environment in which the acoustic sensors array is placed, for separating source signals from noise signals.
According to some embodiments, the array of acoustic sensors comprises multiple omnidirectional microphones, non-omnidirectional microphones, sensors having different frequency sensitivities, or a combination thereof.
According to some embodiments the system further comprises a multichannel analyzer for channeling thereby signals from each of the acoustic sensors. For example, the multichannel analyzer may be a multiplexer.
According to some embodiments the system further comprises at least one holder for holding the multiple acoustic sensors of the array.
In some embodiments, the holder is configured for allowing adjusting direction of each sensor and/or the number of sensors in the array.
According to some embodiments, the holder comprises acoustic isolating and/or reflecting materials.
According to some embodiments, each sensor in the array is bundled to at least one loud-speaker where the output of each loud-speaker is made such that interference, correlated to the bundled sensor, distorts the signals at other microphones for improving acoustic separation between the microphones in an active synthetic manner.
According to some embodiments, the system further comprises at least one audio output means for audio outputting the clean source enhanced signal.
According to some embodiments, at least one of the acoustic sensors in the array comprises at least one protective element and/or at least one directivity improving element.
According to some embodiments, the source signal is related to one of: human speech source, machine or device acoustic sound source, human sound source.
According to some embodiments, the system further comprises at least one additional remote acoustic sensor located remotely from the sensor array.
The present invention further provides a method for enhancing acoustic performances of at least one acoustic source in an adverse acoustic environment. The method, according to some embodiments thereof includes at least the steps of: (a) receiving signals outputted by an array of acoustic sensors each sensor having a different directivity; (b) analyzing the received signals for enhancement of acoustic signals from the at least one source, by correlating the received signals from the sensors, according to respective position of the at least one source in respect to the directivity of the acoustic sensors, the analysis being based on reflections from reverberating surfaces in the specific acoustic environment; and (c) outputting a clean source-enhanced signal, wherein the analysis and sensors directivity allow maintaining the sensor array in compact dimensions without affecting source-signal enhancement and signal separation.
According to some embodiments, the analysis comprises computing a statistical estimate of a speech signal using cross-correlation and auto-correlation of the signals from the acoustic sensors, containing both the desired source and a corrupting noise signals, using cross-correlation and auto-correlation of an interrupting noise signal alone, wherein the output estimate is given by using a minimum variance distortionless response (MVDR) beamformer.
According to some embodiments, the method further comprises the step of adaptively learning of the acoustic characteristics of the environment in which the acoustic sensors array is placed, for improving separating source signal from noise signal.
According to some embodiments, the method further comprises the step of learning the timing performances of the acoustic sensors in the array.
According to some embodiments, the different directivity of each sensor is achieved by at least one of: (i) arranging the sensors in the array such that each is directed to a different direction; (ii) using sensors having different frequency sensitivity.
In the following detailed description of various embodiments, reference is made to the accompanying drawings that form a part thereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.
The present invention, in some embodiments thereof, provides methods and systems for enhancing acoustic performances of one or more acoustic sources in an adverse acoustic environment and particularly for enhancing the source(s) signals.
According to some embodiments, the system comprises: an array of acoustic sensors compactly positionable in different directivity in respect to one another; and an analysis module being configured for calculating and optimizing signal enhancement of the one or more sources, by correlating the sensors according to respective position of the source(s) in respect to the directivity of the acoustic sensors, based on reverberations from reverberating surfaces in the specific acoustic environment, wherein the optimization and sensors directivity allow maintaining the sensor array in compact dimensions without affecting speech enhancement and speaker separation.
The term “directivity” refers to the ability of the sensors and analysis of its output data to distinguish between acoustic signals arriving from different locations such as from the sound sources and/or from reflective surfaces. These reflected signals can originate from the sound source which the system aims to enhance such as one or more speakers' speech signals and from noise sources in the environment in which the system is located. This can be achieved, for example, by directing the sensors to the known or expected locations of noise and/or sound sources and/or to the reflective surfaces in the room. Another additional or alternative way to achieve directivity is by using sensors that have different frequency responsivity or sensitivity i.e. that respond better to one or more ranges of frequencies.
An additional or alternative manner to improve directivity of the sensors can be done by adding directing elements to the sensors array or holder thereof for enhancing reflected sound into the sensors in the array. This can be done, for instance: (i) by adding sound reflecting materials to the holder of the sensors arranged such as to direct acoustic signals reflected from the reflective surfaces in the room into the sensors of the array and/or (ii) by adding directing means such as fins to the sensors themselves.
Reference is now made to
According to some embodiments, the analysis module is configured to receive output signals from all the microphones 111a-111d, identify speech related signals of a speaker 10 from all microphones using reverberations information therefrom to enhance speech signal data outputting “speech data” that is indicative of the speaker's speech. The analysis module 120 can be adapted to also reduce noise from the signals by operating one or more noise reduction algorithms. The speech data produced by the analysis module 120 can be translated to audio output by the output module 130 for using one or more audio output devices such as speaker 40 to output the acoustic signals corresponding to the speech data.
For example, the analysis module 120 computes a statistical estimate of a speech signal using cross-correlation and auto-correlation of the signals from the four microphones 111a-111d containing both the desired speech and a corrupting noise signal and using cross-correlation and auto-correlation of an interrupting noise signal alone. The output estimate for this simple case is then simply given by the known MVDR beamformer.
According to some embodiments, as illustrated in
According to some embodiment, the learning process may also include learning the timing performances of noise and/or of the sound sources that should be enhanced. For example static noise can be learned in terms of its frequencies and amplitudes and voice pitches and the like for improved enhancement and noise reduction. The system may also be configured for timing (synchronizing) sensors' activation or performances according to the known learned sound sources and/or noise timing data.
The performance of linear arrays with omnidirectional microphones is severely affected by the reduction of the total array size as in
According to some embodiments, inevitable differences between the directivity of omnidirectional microphones of the array 110 may be used. A system compromising microphones that are generally regarded as “Omnidirectional” is also in the scope of this invention.
The system can be designed according to the environment/space in which it should be installed. For instance, if the system is to be used in a car, the microphones can be arranged according to the positioning (direction) of the driver (assumed as main speaker), the person seated next to the driver, and the reflecting surfaces in the vehicle. If the array would be placed on a table—microphones may cover the half-sphere heading the upward direction. The microphones array can be arranged to collect as much of the desired sources considering the possible location(s) of the speaker(s) and the reverberating surfaces of the environment.
According to some embodiments, the signal data from the microphones 111a-111d can be channeled to the processor 150 through a multichannel analyzer device such as a multiplexer device or any other known in the art devices that can channel signals from multiple sensors or detectors to a processing means by combining the signals into a single signal or simply channeling each sensor data separately. One example for such device is STEVAL-MK1126Vx demonstration board by STMicroelectronics.
In the equations shown in
“t” indicates the time frame index, the frequency index is omitted for brevity.
z(t)=[z1(t), z2(t) . . . zJ(t)]T—J-channels input signal in timeframe t
v(t)=[v1(t), v2(t) . . . vJ(t)]T—noise signal
s(t)—clean speech signal
̂s(t)—single channel output signal
h=[h1, h2 . . . hj]T—acoustic transfer function
G—J×J noise covariance matrix
Hactive—speech active hypothesis
Hidle—speech non-active hypothesis
The frequency index was omitted to simplify the presentation. The statistical model is z(t)=h·s(t)+v(t). Whereas s(t) is the desired speech signal, h is the acoustic system between the desired source and each of the acoustic sensors and v(t) is the noise signal as depicted by the sensors. The algorithm is designed to estimate s(t) from the noisy measurements. The covariance matrix of v(t) is G.
The Processing Steps:
In the first step, new measurement z(t) is received by the processing system for each frequency band. For each frequency band of each measurement:
(i) the source signal is calculated by the cross product between the input signal and the multi-channel filter referred to hereinafter as the “Capon filter” (see filter suggested by Capon, 1969) i.e.:
The Capon, 1969 filter is designed to minimize the noise, while preserving the desired signal (speech signal in this case) without distortion.
(ii) Identification of speech related components in z(t): to estimate the acoustic system h and the covariance matrix G, it must be determined whether the speech signal s(t) is active or whether there is no speech activity within the respective time-frequency frame being analyzed. Respectively, the acoustic system s(t) and matrix G are estimated by using the idle or active hypotheses.
The above steps of (i) and (ii) are repeated for each timeframe or frequency.
The output of the process illustrated in
In some embodiments of the invention, the system also uses one or more remote acoustic sensors such as remote microphones located remotely from the sensor array for improving system performances. For example, the one or more remote microphones can be located in proximity to one or more respective noise sources in the room.
Physical location of the microphones or any other combination of sensors in the array and optionally the location of one or more remote sensors if such are used should include as much information as possible indicative of noise or signal source. For example it is possible to locate only one microphone or any other type of sound responsive sensor (i.e. optical microphone, MEMS (microelectronic mechanical system) accelerometer, other vibration sensor) such that one or more of the noise sources or signal sources are inputted with high direct sound arrival. Direct arrival of sound that did not undergo reflection could gain better SNR. The sensors therefore can be arranged in a way that they are facing outwardly. For example, on a sphere, cube or any other arbitrary shape of the holder thereof.
The spacing between the sensors in the array determined by the dimensions and shape of the holder thereof, can be even or uneven and can vary depending on system requirements which may depend for instance on the room size, locations of reverberating surfaces and the one or more sources and the like.
The holder may also be designed to allow changing the distances between the sensors in the array for adjusting the array to requirements of the system depending for instance on the location number of reflecting surfaces in the room, noise sources locations, speakers locations etc.
In case of one or more human speakers, each speaker can be either man or woman and the noise sources are either stationary or non-stationary, for example other speakers and/or constant stationary machine noise such as air conditioning device noise. In several cases, the proposed sensor array with four microphones could separate between the desired speakers with low SNR of residual noise. However, if 8 microphones are used, the quality of voice separation between human speakers and noise reduction of the interfering noise will be improved considerably to a level in which human listeners will be able to easily make a conversation, or operate voice recognition devices.
Although it is very general to say that more microphones are better. In a well-controlled environment, in which the number of noise sources is known, it may be required to have one or more microphones than the number of noise/speech sources. So for example, assuming very well controlled environment, five microphones will be required for achieving the best performance with the least amount of microphones for four signal sources and another microphone for releasing constraints and optimization.
The sensor array can be held by one or more holders or holding devices allowing easy arrangement of the sensors and easy directivity adjustment. The holder may also improve directivity of the sensors array and/or sound separation by having acoustic isolating, acoustically reflecting and/or separating materials located between adjacent sensors such as sound reflecting and/or absorbing materials.
Reference is now made to
An additional or alternative way for achieving sensors separation will be by using active noise cancelling. For example consider an array of two microphones. Each microphone is associated with a nearby loudspeaker when the loudspeaker operates at different phase to its respective associated microphone. By destructive interference, the microphones will not “hear” the same sound.
Removing Ambient Direct Pressure Such as Wind Noise Direct Hit:
Wind noise can directly hit the microphone diaphragm and cause overload of the circuits that cannot be digitally removed. Therefore it may be beneficial to add a protective element such as fur or metal mesh to break down the wind direct hit of the sensors without affecting the desired sound. For example, it is also possible to design each sensor in the array in a way that the sensor is covered externally by a protective element. This will remove direct sound arrival therefore this will be on the expanse of performance, but will improve the robustness of the sensor outdoors. Another option is acoustic pipes. Acoustic pipes can physically protect the microphone openings, but that will be on the expanse of performance at higher frequencies due to the dispersive nature of acoustic waveguides.
According to some embodiments, each microphone opening may have a shaped entrance. The shaped entrance may distort the frequency response of the input audio signal in a predicted or desired manner. For example, cone shaped entrance with large enough diameter compared to the size of the microphone membrane will have negligible effect while small diameter entrance canal will have some distortion due to resonance in higher frequencies. While the diameter of the canal determines the magnitude of the effect, the frequency resonance is mainly determined by the length of the canal, for example, the first peak frequency resonance is given by f=c/4L.
According to some embodiments of the invention, the system may include and/or use one or more devices or algorithms for sampling the sensors of the sensor array and for synchronizing these sensors. This may be used for compensating and/or calibrating the sensors operation. A single clock line may be used for all microphones in a way that the clock signal reaches all the microphones at the same time. Another possibility is to perform a preliminary calibration process in which the time delays between the sensors are measured and then the measurements are used for compensation in the analysis stage.
Using Buried Microphones:
The microphones are typically positioned in a way that the microphones are facing outwardly towards the room. However, it is possible to cover the microphones in material that causes multiple reflections in a way that the reflections are causing different responses due to differences in directions of arrival from the room. The material (or mesh) is making a mix of sound impinging a larger portion of space than the sensor would normally would. So the benefit is that instead that the sensor microphones will sample few points in space, it will sample a larger volume of space. The mesh can be made from heavy and/or high impedance materials. The small parts of the mesh can be larger than the acoustic wavelength and in some embodiments smaller than the acoustic wavelength.
Reference is now made to
Many alterations and modifications may be made by those having ordinary skill in the art without departing from the spirit and scope of the invention. Therefore, it must be understood that the illustrated embodiment has been set forth only for the purposes of example and that it should not be taken as limiting the invention as defined by the following invention and its various embodiments and/or by the following claims. For example, notwithstanding the fact that the elements of a claim are set forth below in a certain combination, it must be expressly understood that the invention includes other combinations of fewer, more or different elements, which are disclosed in above even when not initially claimed in such combinations. A teaching that two elements are combined in a claimed combination is further to be understood as also allowing for a claimed combination in which the two elements are not combined with each other, but may be used alone or combined in other combinations. The excision of any disclosed element of the invention is explicitly contemplated as within the scope of the invention.
The words used in this specification to describe the invention and its various embodiments are to be understood not only in the sense of their commonly defined meanings, but to include by special definition in this specification structure, material or acts beyond the scope of the commonly defined meanings. Thus if an element can be understood in the context of this specification as including more than one meaning, then its use in a claim must be understood as being generic to all possible meanings supported by the specification and by the word itself.
The definitions of the words or elements of the following claims are, therefore, defined in this specification to include not only the combination of elements which are literally set forth, but all equivalent structure, material or acts for performing substantially the same function in substantially the same way to obtain substantially the same result. In this sense it is therefore contemplated that an equivalent substitution of two or more elements may be made for any one of the elements in the claims below or that a single element may be substituted for two or more elements in a claim. Although elements may be described above as acting in certain combinations and even initially claimed as such, it is to be expressly understood that one or more elements from a claimed combination can in some cases be excised from the combination and that the claimed combination may be directed to a sub-combination or variation of a sub-combination.
Insubstantial changes from the claimed subject matter as viewed by a person with ordinary skill in the art, now known or later devised, are expressly contemplated as being equivalently within the scope of the claims. Therefore, obvious substitutions now or later known to one with ordinary skill in the art are defined to be within the scope of the defined elements.
The claims are thus to be understood to include what is specifically illustrated and described above, what is conceptually equivalent, what can be obviously substituted and also what essentially incorporates the essential idea of the invention.
Although the invention has been described in detail, nevertheless changes and modifications, which do not depart from the teachings of the present invention, will be evident to those skilled in the art. Such changes and modifications are deemed to come within the purview of the present invention and the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IL16/50475 | 5/5/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62157608 | May 2015 | US |