This disclosure relates to a beamformer and, in particular, to a beamformer that includes multiple layers of mini-length minimum variance distortionless response (MVDR) beamformers used to estimate a sound source and reduce noise contained in signals received by a sensor array.
Each sensor in a sensor array may receive a copy of a signal emitted from a source. The sensor can be a suitable type of sensor such as, for example, a microphone sensor to capture sound. For example, each microphone sensor in a microphone array may receive a respective version of a sound signal emitted from a sound source at a distance from the microphone array. The microphone array may include a number of geographically arranged microphone sensors for receiving the sound signals (e.g., speech signals) and converting the sound signals into electronic signals. The electronic signals may be converted using analog-to-digital converters (ADCs) into digital signals which may be further processed by a processing device (e.g., a digital signal processor). Compared with a single microphone, the sound signals received at microphone arrays include redundancy that may be explored to calculate an estimate of the sound source to achieve noise reduction/speech enhancement, sound source separation, de-reverberation, spatial sound recording, and source localization and tracking. The processed digital signals may be packaged for transmission over communication channels or converted back to analog signals using a digital-to-analog converter (DAC).
The microphone array can be coupled to a beamformer, or directional sound signal receptor, which is configured to calculate the estimate of the sound source. The sound signal received at any microphone of the microphone array may include a noise component and a delayed component with respect to the sound signal received at a reference microphone sensor (e.g., a first microphone sensor in a microphone array). A beamformer is a spatial filter that uses the multiple copies of the sound signal received at the microphone array to identify the sound source according to certain optimization rules.
A minimum variance distortionless response (MVDR) beamformer is a type of beamformers that is obtained by minimizing the variance (or power) of noise at the beamformer while ensuring the distortionless response of the beamformer towards the direction of the desired source. The MVDR beamformer is commonly used in the context of noise reduction and speech enhancement using microphone arrays.
The present disclosure is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
An MVDR beamformer may receive a number of input signals and calculate an estimate of either the source signal or the sound source received at a reference microphone based on the input signals. The number of inputs is referred to as the length of the MVDR beamformer. Thus, when the number of inputs for the MVDR beamformer is large, the length of the MVDR beamformer is also long.
Implementations of long MVDR beamformers commonly require inversion of a large, ill-conditioned noise correlation matrix. Because of this inversion, MVDR beamformers introduce white noise amplification, particularly at low frequencies, due to the ill-condition of the noise correlation matrix. Further, the computation to inverse a large matrix is computationally expensive. This is especially true when the matrix inversion is calculated for each frequency sub-band over a wide frequency spectrum because the matrix inversion needs to be performed for each of the multiple frequency sub-bands. Therefore, there is a need for an MVDR beamformer that can achieve results similar to an MVDR beamformer, but is less sensitive to white noise amplification and requires less computation than the conventional MVDR.
Implementations of the present disclosure relate to a multistage MVDR beamformer including multiple layers of mini-length MVDR beamformers. The lengths of the mini-length MVDR beamformers are smaller or substantially smaller than the total number of input for the multistage MVDR beamformer (or correspondingly, the total number of microphone sensors in a microphone array). Each layer of the multistage MVDR beamformer includes one or more mini-length (e.g., length-2 or length-3) MVDR beamformers, and each mini-length MVDR beamformer is configured to calculate an MVDR estimate output for a subset of the input signals of the layer. The calculation of the multistage MVDR beamformer is carried out in cascaded layers progressively from a first layer to a last layer, whereas a first layer may receive the input signals from microphone sensors of the microphone array and produce a set of MVDR estimates as input signals to a second layer. Because each mini-MVDR beamformer produces one MVDR estimate for a subset of input signals, the number of input signals to the second layer is smaller than those of the first layers. Thus, the second layer includes fewer MVDR beamformers than the first layer. The second layer may similarly produce a further set of MVDR estimates of its input signals to be used as input signals to a subsequent third layer. Likewise, the third layer includes fewer MVDR beamformers than the second layer. This multistage MVDR beamforming propagates through these layers of mini-length MVDR beamformers till the last layer including one MVDR beamformer to produce an MVDR estimate for the multistage MVDR beamformer.
In one implementation, a microphone array may include M microphones (M>3) that provides M input signals to a multistage MVDR beamformer including mu3ltiple layers of length-2 MVDR beamformers. The first layer of the multistage MVDR beamformer may include M/2 length-2 MVDR beamformers. Each of the length-2 MVDR beamformers may be configured to receive two input signals captured at two microphones and calculate an MVDR estimate for the two input signals. The MVDR estimates from the length-2 MVDR beamformers are provided as M/4 input signals to a seconder layer.
The second layer may similarly include M/4 length-2 MVDR beamformers. Each of the length-2 MVDR beamformers of the second layer may receive two input signals received from the first layer and calculate an MVDR estimate for the two input signals. The second layer may generate M/8 MVDR estimates which may be provided to a next layer of length-2 MVDR beamformers.
This process of length-2 MVDR beamforming may be performed repeatedly in stages through layers of length-2 MVDR beamformers till the calculation of the multistage MVDR beamformer reaches an Nth layer that includes only one length-2 MVDR beamformer receiving two input signals from the (N-1)th layer and calculating an MVDR estimate for the two input signals received from the (N-1)th layer. In one implementation, the length-2 MVDR estimate of the Nth layer is the result of the multistage MVDR beamformer. Because multistage MVDR beamformer only needs to perform the calculation of length-2 MVDR beamformers including the inversion of a two-by-two noise correlation matrix, the need to inverse the ill-conditioned, large noise correlation matrices is eliminated, thereby mitigating the white noise amplification problem associated with a single-stage long MVDR beamformer for a large microphone array. Further, the multistage MVDR beamformer is computationally more efficient than the computation of a single-stage MVDR beamformer with a large number (M) of microphone sensors (e.g., when M is greater than or equal to eight). Further, because of the less computation requirement, the multistage MVDR beamformers may be implemented on less sophisticated (or cheaper) hardware processing devices than single-stage long MVDR beamformers while achieving similar noise reduction performance.
Implementations of the present disclosure may relate to a method including receiving, by a processing device, a plurality of sound signals captured at a plurality of microphone sensors, wherein the plurality of sound signals are from a sound source, and wherein a number (M) of the plurality of microphone sensors is greater than three, determining a number (K) of layers for a multistage minimum variance distortionless response (MVDR) beamformer based on the number (M) of the plurality of microphone sensors, wherein the number (K) of layers is greater than one, and wherein each layer of the multistage MVDR beamformer comprises one or more mini-length MVDR beamformers, and executing the multistage MVDR beamformer to the plurality of sound signals to calculate an estimate of the sound source.
Implementations of the present disclosure may include a system including a memory and a processing device, operatively coupled to the memory, the processing device to receive a plurality of sound signals captured at a plurality of microphone sensors, wherein the plurality of sound signals are from a sound source, and wherein a number (M) of the plurality of microphone sensors is greater than three, determine a number (K) of layers for a multistage minimum variance distortionless response (MVDR) beamformer based on the number (M) of the plurality of microphone sensors, wherein the number (K) of layers is greater than one, and wherein each layer of the multistage MVDR beamformer comprises one or more mini-length MVDR beamformers, and execute the multistage MVDR beamformer to the plurality of sound signals to calculate an estimate of the sound source.
The microphone sensors on the microphone array 102 may convert a1(t), a2(t), . . . , aM(t) into electronic signals ea1(t), ea2(t), . . . , eaM(t) that may be fed into the ADC 104. In one implementation, the ADC 104 may be configured to convert the electronic signals into digital signals y1(t), y2(t), . . . , yM(t) by quantization.
In one implementation, the processing device 106 may include an input interface (not shown) to receive the digital signals, and as shown in
In one implementation, the pre-processing module 108 may be configured to perform STFT on the input y1(t), y2(t), . . . , yM(t) and generate the frequency domain representations Y1(ω), Y2(ω), . . . , YM(ω), wherein ω(ω=2πf) represents the angular frequency domain. In one implementation, the multistage MVDR beamformer 110 may be configured to receive frequency representations Y1(ω), Y2(ω), . . . , YM(ω) of the input signals and calculate an estimate Z(ω) in the frequency domain of the sound source (s(t)) based on the received Y1(ω), Y2(ω), . . . , YM(ω). In one implementation, the frequency domain may be divided into a number (L) of frequency sub-bands, and the multistage MVDR beamformer 110 may calculate the estimate Z(ω) for each of the frequency sub-bands.
The processing device 106 may be configured with a post-processing module 112 that may convert the estimate Z(ω) for each of the frequency sub-bands back into the time domain to provide the estimate sound source (X1(t)). The estimated sound source (X1(t)) may be determined with respect to the source signal received at a reference microphone sensor (e.g., m1).
Instead of using a single-stage long MVDR beamformer to estimate the sound signal (s(t)), implementations of the present disclosure provides for a multistage MVDR beamformer that includes one or more layers of mini-length MVDR beamformers that together may provide an estimate that is substantially similar distortionless character as the single-stage MVDR beamformer but with less noise amplification and more efficient computation.
Referring to
Instead of performing a length-4 MVDR beamformer for all input ym(t), m=1, . . . , 4, the multistage MVDR beamformer 200 as shown in
The mini-length MVDR beamformer may be any suitable type of MVDR beamformers. In one implementation, the mini-length MVDR beamformer may include applying complex weights Hi*(f), i=1, . . . , M′ to each of the input signal received by the mini-length MVDR beam and calculate a weighted sum, wherein f=ω/2π is the frequency, the superscript * is the complex conjugate operator and M′ is the length of the mini-length MVDR beamformer, and Σ is the sum operator. For the length-2 MVDR beamformers 202A, 202B, 206 as shown in
wherein hMVDR is a vector including elements of the MVDR weights and is defined as hMVDR=[H1(f), H2(f), . . . , HM'(f)]T, Φv(f)=E[v(f)vH(f)] and Φy(f)=E[y(f)yH(f)]are the correlation matrices of the noise vector v(f)=[V1(f), V2(f), . . . , VM(f)]T and the noisy signal vector y(f)=[Y1(f), Y2(f), . . . , YM′(f)]T , d(f)=[1,e−j2πfτ0 cos(θd), . . . , e−j2π(M′−1)fτ0 cos(θd)]T is the steering vector where σ0 is the delay between two adjacent microphone sensors at an incident angle θd=0° , superscript T represents the transpose operator, superscript H represents the conjugate-transpose operator, tr[.] represents the trace operator, IM′ is the identity matrix of size M′, and iM′ is the first column of IM′. As shown in Equation (1), the calculation of MVDR includes inversion of a noise correlation matrix Φv of size M′. Because the mini length (M′) is smaller than the total number (M) of microphone sensors, the inversion is easier to calculate and the noise amplification may be mitigated in the multistage MVDR beamformers. The noise correlation matrix Φv(f) may be calculated using a noise estimator during a training process when the sound source is absent. Alternatively, the noise correlation matrix may be calculated online. For example, when the sound source is a human speaker, the noise correlation matrix may be calculated when the speaker pauses. The steering vector d(f) may be derived from a given incident angle (or look direction) of the sound source with respect to the microphone array. Alternatively, the steering vector may be calculated using a direction-or-arrival estimator to estimate the delays.
The multistage MVDR beamformer may include multiple cascaded layers of different combinations of mini-length MVDR (M-MVDR).
In some implementations, a multistage MVDR beamformer may include layers of mini-length MVDR beamformers of different lengths. For example, the multistage MVDR beamformer may include one or more layers of 2-MVDR beamformers and one or more layers of 3-MVDR beamformers. In one implementation, the layer of longer length mini-length MVDR beamformers may be used in earlier stages to rapidly reduce the number of input to a smaller number of estimates, and the layers of smaller length mini-length MVDR beamformers may be used in later stages to generate finer estimates.
In some implementations, a multistage MVDR beamformer may include one or more layers that each includes mini-length MVDR beamformers of different lengths. For example, the multistage MVDR beamformers may process a first section (e.g., microphone sensors on the edge sections of the microphone array) of noisy input using 2-MVDR beamformers and a second section (e.g., microphone sensors in the middle section of the microphone array) of noisy input using 3-MVDR beamformers. This type of multistage beamformers may provide different treatments for different sections of microphone sensors based on the locations of microphone sensors in a microphone array rather than using a uniform treatment for all microphone sensors.
For simplicity of explanation, methods are depicted and described as a series of acts. However, acts in accordance with this disclosure can occur in various orders and/or concurrently, and with other acts not presented and described herein. Furthermore, not all illustrated acts may be required to implement the methods in accordance with the disclosed subject matter. In addition, the methods could alternatively be represented as a series of interrelated states via a state diagram or events. Additionally, it should be appreciated that the methods disclosed in this specification are capable of being stored on an article of manufacture to facilitate transporting and transferring such methods to computing devices. The term article of manufacture, as used herein, is intended to encompass a computer program accessible from any computer-readable device or storage media. In one implementation, the methods may be performed by the multistage MVDR beamformer 110 executed on the processing device 106 as shown in
Referring to
At 404, the processing device may receive the sound signals from the microphone sensors. In one implementation, a microphone array may include M (M>3) microphone sensors, and the processing device is configured to receive the M sound signals from the microphone sensors.
At 406, the processing device may determine a number of layers of a multistage MVDR beamformer that is to be used to estimate the sound source. The multistage MVDR beamformer may be used for noise reduction and produce a cleaned version of the sound source (e.g., speech). In one implementation, the multistage MVDR beamformer may be constructed to include K (K>1) layers, and each layer may include one or mini-length MVDR beamformers, wherein the lengths (M′) of the mini-length MVDR beamformers are smaller than the number (M) of microphone sensors.
At 408, the processing device may execute the multistage MVDR beamformer to calculate an estimate for the sound source. In one implementation, the K layers of the multistage MVDR beamformer may be cascaded from a first layer to the Kth layer with progressively decreasing numbers of mini-length MVDR beamformers from the first layer to the Kth layer. In one implementation, the first layer may include M/2 length-2 MVDR beamformers, and each of the length-2 MVDR beamformers of the first layer may be configured to receive two sound signals and calculate a length-2 MVDR estimate for the two sound signals. Thus, the first layer may produce M/2 estimates which may be fed into a second layer. Similarly, the second layer may include M/4 length-2 MVDR beamformers that generate M/8 estimates. This estimation process may be repeated till it reaches the Kth layer which may include one length-2 MVDR beamformer that calculate the an estimate of the sound source for the M sound signals received from the M microphone sensors.
The exemplary computer system 500 includes a processing device (processor) 502, a main memory 504 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), a static memory 506 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 518, which communicate with each other via a bus 508.
Processor 502 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, the processor 502 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. The processor 502 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processor 502 is configured to execute instructions 526 for performing the operations and steps discussed herein.
The computer system 500 may further include a network interface device 522. The computer system 500 also may include a video display unit 510 (e.g., a liquid crystal display (LCD), a cathode ray tube (CRT), or a touch screen), an alphanumeric input device 512 (e.g., a keyboard), a cursor control device 514 (e.g., a mouse), and a signal generation device 520 (e.g., a speaker).
The data storage device 518 may include a computer-readable storage medium 524 on which is stored one or more sets of instructions 526 (e.g., software) embodying any one or more of the methodologies or functions described herein (e.g., processing device 102). The instructions 526 may also reside, completely or at least partially, within the main memory 504 and/or within the processor 502 during execution thereof by the computer system 500, the main memory 504 and the processor 502 also constituting computer-readable storage media. The instructions 526 may further be transmitted or received over a network 574 via the network interface device 522.
While the computer-readable storage medium 524 is shown in an exemplary implementation to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.
In the foregoing description, numerous details are set forth. It will be apparent, however, to one of ordinary skill in the art having the benefit of this disclosure, that the present disclosure may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present disclosure.
Some portions of the detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “segmenting”, “analyzing”, “determining”, “enabling”, “identifying,” “modifying” or the like, refer to the actions and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories
(ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions.
The words “example” or “exemplary” are used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “example’ or “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the words “example” or “exemplary” is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X includes A or B” is intended to mean any of the natural inclusive permutations. That is, if X includes A; X includes B; or X includes both A and B, then “X includes A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form. Moreover, use of the term “an embodiment” or “one embodiment” or “an implementation” or “one implementation” throughout is not intended to mean the same embodiment or implementation unless described as such.
Reference throughout this specification to “one implementation” or “an implementation” means that a particular feature, structure, or characteristic described in connection with the implementation is included in at least one implementation. Thus, the appearances of the phrase “in one implementation” or “in an implementation” in various places throughout this specification are not necessarily all referring to the same implementation. In addition, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or.”
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementations will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
This application claims the benefit of U.S. provisional application Ser. No. 62/136,037 filed on Mar. 20, 2015, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62136037 | Mar 2015 | US |