The present disclosure relates to audio processing of soundfields and sub-soundfields.
A “near-end” video conference endpoint captures video of and audio from participants in a room during a conference, for example, and then transmits the captured video and audio to “far-end” video conference endpoints. During the conference, reproduced voice conversations should sound natural and clear to the participants, as if the far-end and near-end participants were in the same room. Participants usually occupy random positions in the room, and it is common practice to place/distribute a number of microphones on a table, on walls, and/or in a ceiling of the room. Typically, a conference sound mixer is used to mix microphone channels from the microphones with highest sound levels, a highest signal to noise ratio (SNR), or a highest direct sound to reverberation ratio (DRR), in an attempt to detect participant voices with a good sound quality. Use of such distributed microphones has drawbacks. For example, from an aesthetic perspective, the distributed microphones add room clutter. Also, installing, configuring, and maintaining the distributed microphones (and mixers) can be time consuming and expensive. In addition, the audio signals captured at the spatially distributed microphones may be highly coherent with different and random phase delays such that, when mixed together, the resultant signal may be distorted due to a comb filtering effect.
Overview
At a microphone array in a conference endpoint, a soundfield is detected to produce a set of microphone signals each from a corresponding microphone of the microphone array. The set of microphone signals represent the soundfield. The detected soundfield is decomposed into a set of sub-soundfield signals based on the set of microphone signals. Each sub-soundfield signal is processed, such that each sub-soundfield signal is dereverberated to remove reverberation therefrom, to produce a set of processed sub-soundfield signals. The set of processed sub-sound field signals are mixed into a mixed output signal.
Example Embodiments
Embodiments presented herein integrate a microphone array into a video conference endpoint as a replacement for a conventional collection of table, wall, and ceiling microphones. While the integrated microphone array simplifies the physical microphone arrangement, a soundfield detected by the microphone array is susceptible to undesired interference, including room noise, reflections, and reverberation, which can result in a distorted, reverberant, and hollow sound quality. Accordingly, at a high-level, the embodiments employ microphone array-based sound field decomposition to decompose the detected soundfield into multiple sub-soundfields, multi-channel dereverberation to separately reduce reverberation of each sub-soundfield, and associated audio mixing of the dereverberated sub-soundfields into a mixed audio signal, respectively. These operations effectively extend an audio pickup range of the microphone array, capture desired speech signals more distinctly, and filter noise, room reflections, and reverberation, with reduced comb-filtering effects. One reason for these improvements is that, after the soundfield decomposition and dereverberation, levels of interference and reverberation in any given sub-soundfield is less than that of the entire detected soundfield and may be reduced on a per sub-soundfield basis, and the known phase/group delays between different sub-soundfields are approximately fixed and may be pre-compensated.
With reference to
Endpoint 104 may include a video camera (VC) 112, a video display 114, a loudspeaker (LDSPKR) 116, and a microphone array (MA) 118, which may include a two-dimensional array of microphones as depicted in
According to embodiments presented herein, at a high-level, a soundfield in room 105 may include desired sound, such as speech from participant 106. The soundfield may also include undesired sound, such as reverberation, echo, and other audio noise. Microphone array 118 detects the soundfield to produce a set of microphone signals (also referred to as “sound signals”). Endpoint 104 converts the set of microphone signals representative of the detected soundfield into a set of sub-soundfields. Endpoint 104 processes each sub-soundfield separately/individually to suppress reverberation, suppress echo, and reduce noise therein, to produce a set of processed sub-soundfields each corresponding to a respective one of the sub-soundfields. Endpoint 104 audio mixes the set of processed sub-soundfields into a mixed audio signal, which may be encoded and transmitted over a network.
Reference is now made to
Processor 244 may include a collection of microcontrollers and/or microprocessors, for example, each configured to execute respective software instructions stored in the memory 248. The collection of microcontrollers may include, for example: a video controller to receive, send, and process video signals related to display 114 and video camera 112; an audio processor to receive, send, and process audio signals related to loudspeaker 116 and MA 118; and a high-level controller to provide overall control. Portions of memory 248 (and the instruction therein) may be integrated with processor 244. In the transmit direction, processor 244 processes audio/video captured by MA 118/VC 112, encodes the captured audio/video into data packets, and causes the encoded data packets to be transmitted to communication network 110. In a receive direction, processor 244 decodes audio/video from data packets received from communication network 110 and causes the audio/video to be presented to local participant 106 via loudspeaker 116/display 114. As used herein, the terms “audio” and “sound” are synonymous and used interchangeably.
The memory 248 may comprise read only memory (ROM), random access memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible (e.g., non-transitory) memory storage devices. Thus, in general, the memory 248 may comprise one or more computer readable storage media (e.g., a memory device) encoded with software comprising computer executable instructions and when the software is executed (by the processor 244) it is operable to perform the operations described herein. For example, the memory 248 stores or is encoded with instructions for control logic 250 perform operations described herein.
Control logic 250 may include a soundfield processor 252 to convert a detected soundfield into sub-soundfields, a sub-soundfield processor 254 to process each of the sub-soundfields separately to produce processed sub-soundfields, and an audio mixer 256 to audio mix/combine the processed sub-soundfields into a mixed audio output. In an embodiment, audio mixer 256 (also referred to simply as “mixer” 256) is an auto-mixer, but the mixer need not be an auto-mixer in other embodiments. In addition, memory 248 stores data 280 used and generated by modules 250-256.
With reference to
Microphones 302(1)-302(M) of microphone array 118 concurrently detect a soundfield in room 105, to produce a parallel (i.e., concurrent) set of microphone signals 304(1)-304(M) (i.e., sound signals 304(1)-304(M)) each from a corresponding one of the microphones in the microphone array. The set of microphone signals 304(1)-304(M) represent the detected soundfield. The detected soundfield represents sound, with all of its acoustical characteristics, propagating in room 105 and impinging on microphone array 118.
Soundfield processor 252 decomposes or transforms the set of microphone signals 304(1)-304(M) representative of the detected soundfield into a parallel set of sub-soundfield signals 306(1)-306(N), where N may be equal to or different from M. The terms “sub-soundfield”and “sub-soundfield signal” are synonymous and used interchangeably. In a frequency domain embodiment of soundfield decomposition, soundfield processor 252 transforms each microphone signals 304(1)-304(M) from the time domain into the frequency domain using a Fourier transform. Thus, given M microphone signals, soundfield processor 252 computes M Fourier transforms, each having F frequency bins. In the frequency domain, for a given frequency f (i.e., frequency bin) and time frame k, a vector X(f,k) represents the entire detected soundfield at the given frequency f, where X(f,k):
The vector X(f,k) is of size 1×M because each element xi of the vector X(f,k) is a frequency domain representation of the microphone signal of frequency f (in frequency bin f). In other words, element x1 is the amplitude in frequency bin f from the Fourier transform of microphone signal 304(1), element x2 is the amplitude in frequency bin f from the Fourier transform of microphone signal 304(2), . . . , element xM is the amplitude in frequency bin f from the Fourier transform of microphone signal 304(M).
Given the vector X(f,k), a sub-soundfield signal vector Y(f,k) (of size 1×N), where Y(f,k)={y1(f,k), y2(f,k), . . . yN(f,k)}, may be calculated using a matrix transformation as follows:
H(f) is referred to as a frequency domain soundfield decomposition matrix of size M×N.
In a time domain embodiment of soundfield decomposition, soundfield processor 252 may decompose the detected soundfield into a set of N sub-soundfields signals in the time domain using a time domain decomposition matrix H(t) having elements hij(t) (i=1−N, j=1−M) that are time domain filters, which operate directly on microphone signals 304(1)-304(M). That is, the time domain decomposition matrix is a matrix of time domain filters.
In a beamforming embodiment of soundfield decomposition, a microphone array beamforming technique may be used to generate several audio beams from microphone signals 304(1)-304(M), and to point the audio beams at different angles or toward different spatial sections in order to divide the detected soundfield into sub-soundfields or a so-called “beamspace.”
Sub-soundfield processor 254 processes each sub-soundfield signal 306(1)-306(N) separately/individually and in parallel with the other sub-soundfield signals to suppress echo, suppress reverberation (i.e., dereverberate), and reduce noise in the sub-soundfield signal, to produce a parallel set of processed sub-soundfield signals 308(1)-308(N) corresponding to sub-soundfield signals 306(1)-306(N), respectively. For example, sub-soundfield processor 354 applies acoustic echo control, dereverberation, and noise reduction processing to sub-soundfield signal vector Y, to obtain processed subs-soundfield signal vector
Mixer 256 mixes or combines the set of processed sub-soundfield signals 308(1)-308(N) into a mixed/combined audio signal 320 that is substantially free of undesired echo, reverberation, and other noise artifacts as a result of the sub-soundfield processing performed by sub-soundfield processor 254. Mixer 256 may receive one of microphone signals 304(1)-304(M), e.g., microphone signal 304(1), and use the received microphone signal in the mix process.
With reference to
Acoustic echo cancelers 402(1)-402(N) operate in parallel to separately cancel acoustic echo from respective ones of sub-soundfield signals 306(1)-306(N) based on loudspeaker signal 310, to produce parallel echo-canceled sub-soundfield signals 410(1)-410(N), respectively.
Multi-channel dereverberator 404 separately cancels/suppresses reverberation in each of echo-canceled sub-soundfield signals 410(1)-410(N) to produce echo-canceled, dereverberated sub-soundfield signals 412(1)-412(N), each corresponding to a respective one of sub-soundfield signals 306(1)-306(N). Thus, in the example of
Noise reducers 406(1)-406(N) operate in parallel to separately suppress residual echo and other noise artifacts in echo-canceled, dereverberated sub-soundfield signals 412(1)-412(N), respectively, to produce processed sub-soundfield signals 308(1)-308(N) as echo-canceled, dereverberated, and noise reduced processed sub-soundfield signals. Thus, in the example of
The order of cancelers 402(1)-402(N), multi-channel dereverberator 404, and noise reducers 406(1)-406(N) depicted in
With reference to
Dereverberator channel 500 dereverberates sub-soundfield signal 306(1) indirectly via echo-canceled sub-soundfield signal 410(1). That is, dereverberator channel 500 operates on echo-canceled sub-soundfield signal 410(1) to suppress reverberation in sub-soundfield signal 306(1). In dereverberator channel 500, echo-canceled sub-soundfield signal 410(1) represents a main capture channel, i.e., the signal from which reverberation is to be removed. Dereverberator channel 500 includes a summing node 501 to receive at a first input thereof echo-canceled sub-soundfield signal 410(1) from which reverberation is to be removed, and time delay units 502(1)-502(N−1) to receive echo-canceled sub-soundfield signals 410(2)-410(N) (i.e., all of the echo-canceled sub-soundfield signals, except for the echo-canceled sub-soundfield signal from which the reverberation is to be canceled). Time delay units 502(1)-502(N−1) introduce predetermined time delays (i.e., “delays”) into echo-canceled sub-soundfield signals 410(2)-410(N), respectively, relative to main capture channel 410(1). Time delay values used by time delays 502(1)-502(N−1) may all be equal or may differ. The time delay values represent typical sound reverberation times expected in room 105. The larger the room, the larger the values. Example time delay values may range from 20-30 ms, although other values may be used depending on a size of room 105.
Time delay units 502(1)-502(N−1) output time-delayed versions of echo-canceled sub-soundfield signals 410(2)-410(N), respectively, to a reverberation estimator 504. Reverberation estimator 504 estimates reverberation in main capture channel 410(1) based on the time delayed versions of echo-canceled sub-soundfield signals 410(2)-410(N), and outputs a reverberation estimate 506 to a second input of summing node 501. In an example, reverberation estimator 504 includes an adaptive filter to adaptively filter the delayed versions mentioned above, to produce reverberation estimate 506. The adaptive filter may use any known or hereafter developed adaptive filtering technique, including, for example, normalized least mean squares (NLMS), recursive least squares (RLS), and an affline projection algorithm (APA).
Summing node 501 subtracts reverberation estimate 506 only from main capture channel 410(1), to produce echo-canceled, dereverberated signal 412(1).
Thus, generally, for each sub-soundfield signal 302(i) to be dereverberated, multi-channel dereverberator 404 delays all of sub-soundfield signals 302(1)-302(N), except for the sub-soundfield signal 302(i), estimates reverberation in the sub-soundfield signal 302(i) based on the delayed sub-soundfield signals, and subtracts the estimated reverberation from sub-soundfield signal 302(i), to produce the corresponding dereverberated sub-soundfield signal.
With reference to
Time-delay units 602(1)-602(N) introduce predetermined delays into respective ones of processed sub-soundfield signals 308(1)-308(N), to produce delayed versions
Weight calculator 606 receives one of microphone signals 304(1)-304(N), e.g., 304(1), and computes signal weights w(1)-w(N) based on the delayed versions of the processed sub-soundfield signals
Multipliers 604(1)-604(N) weight the delayed versions
Combiner 608 combines all of the weighted signals into a combined or mixed audio signal
The pre-delaying, weighting, and combining operations performed by Mixer 256 are collectively represented in the following equation:
With reference to
At 704, weight calculator 604 computes (i) microphone signal power power_mic1 of the one of the microphone signals (e.g., microphone signal 304(1)) received at the weight calculator, and (ii) a respective signal power power_subsfi (where i=1−N) of each processed sub-soundfield signal 306(i). Weight calculator 604 may compute each signal power based on either the corresponding processed sub-soundfield signal or its pre-delayed version because their signal powers are the same.
At 706, weight calculator 604 determines a minimum signal power channel_subsf_min and a maximum signal power channel_subsf_max among the respective signal powers of processed sub-soundfield signals 306(1)-306(N). For the previous frame, the maximum signal power channel_subsf_max_last has already been determined and stored.
At 708, weight calculator 604 performs multiple soundfield/sub-soundfield tests (also referred to simply as “soundfield tests” or just “tests”) based on the microphone signal power and the minimum and maximum signal powers. The multiple soundfield tests may include the following tests:
At 710, weight calculator 604 determines whether all of the multiple soundfield/sub-soundfield tests pass (i.e., evaluate to true).
At 712, if all of the multiple soundfield/sub-soundfield tests do not pass, weight calculator 604 maintains weights w(1)-w(N) from the previous frame. That is, for the current frame, weight calculator 604 outputs the same weights used in the previous frame.
At 714, if all of the multiple soundfield/sub-soundfield tests pass, weight calculator 604:
In an example of operation 714, weight calculator 604 computes/assigns the weights as follows:
Embodiments presented herein simplify an audio configuration used for audio/visual conferencing and reduce microphone clutter by eliminating the conventional collection of microphones used for video/audio conferencing. The embodiments also mitigate comb-filtering effects usually present in audio mixing. The embodiments process sub-soundfield signals separately from each other in corresponding ones of sub-soundfield signal processing channels, that each include per channel/individualized echo-canceling, dereverberating, noise reducing, pre-delaying, and weighting, leading to combining of the channels in a last audio mixing operation, which may be an auto-mixing operation. Such individualized sub-soundfield signal processing advantageously leads to improved dereverberation in the audio mixed audio signal.
In summary, in one form, a method is provided comprising: at a microphone array, detecting a soundfield to produce a set of microphone signals each from a corresponding microphone of the microphone array, the set of microphone signals representative of the soundfield; decomposing the detected soundfield into a set of sub-soundfield signals based on the set of microphone signals; processing each sub-soundfield signal, including dereverberating each sub-soundfield signal to remove reverberation therefrom, to produce a set of processed sub-soundfield signals; and mixing the set of processed sub-sound field signals into a mixed audio output signal.
In summary, in another form, an apparatus is provided comprising: a microphone array configured to detect a soundfield to produce a set of microphone signals each from a corresponding microphone in the microphone array, the set of microphone signals representative of the soundfield; and a processor coupled to the microphones and configured to: decompose the detected soundfield into a set of sub-soundfield signals based on the set of microphone signals; process each sub-soundfield signal, including dereverberating each sub-soundfield signal to remove reverberation therefrom, to produce a set of processed sub-soundfield signals; and mix the set of processed sub-sound field signals into a mixed output signal.
In summary, in yet another form, a non-transitory processor readable medium is provided to store instructions that, when executed by a processor, cause the processor to perform the methods described above. Stated otherwise, a non-transitory computer-readable storage media encoded with software comprising computer executable instructions and when the software is executed operable to: receive from a microphone array configured to detect a soundfield a set of microphone signals each from a corresponding microphone of the microphone array, the set of soundfield signals representative of the detected soundfield; decompose the detected soundfield into a set of sub-soundfield signals based on the set of microphone signals; process each sub-soundfield signal, including dereverberating each sub-soundfield signal to remove reverberation therefrom, to produce a set of processed sub-soundfield signals; and mix the set of processed sub-sound field signals into a mixed output signal.
The above description is intended by way of example only. Various modifications and structural changes may be made therein without departing from the scope of the concepts described herein and within the scope and range of equivalents of the claims.
Number | Name | Date | Kind |
---|---|---|---|
4131760 | Christensen | Dec 1978 | A |
9232309 | Zheng et al. | Jan 2016 | B2 |
9288576 | Togami et al. | Mar 2016 | B2 |
20110158418 | Bai | Jun 2011 | A1 |
20140241528 | Gunawan | Aug 2014 | A1 |
Number | Date | Country |
---|---|---|
2015013058 | Jan 2015 | WO |
2016004225 | Jan 2016 | WO |
Entry |
---|
Claude Marro, Yannick Mahieux and K. Uwe Simmer, Analysis of Noise Reduction and Deverberation Techniques Based on Microphone Array with Postfiltering, Jan. 1, 1996, IEEE, pp. 240-259. |
“Microphone Array”, Microsoft Research, http://research.microsoft.com/en-us/projects/microphone—array/, downloaded from the Internet on Mar. 29, 2016, 4 pages. |
S. Yan et al., “Optimal Modal Beamforming for Spherical Microphone Arrays”, IEEE Tranasactions on Audio, Speech, and Language Processing, vol. 19, No. 2, Feb. 2011, 11 pages. |
H. Sun et al., “Optimal Higher Order Ambisonics Encoding With Predefined Constraints”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, No. 3, Mar. 2012, 13 pages. |
Shefeng Yan, “Broadband Beamspace DOA Estimation: Frequency-Domain and Time-Domain Processing Approaches”, Hindawi Publishing Corporation, EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 16907, doi:10.1155/2007/16907, Sep. 2006, 10 pages. |
Joseph T. Khalife, “Cancellation of Acoustic Reverberation Using Adaptive Filters”, Center for Communications and Signal Processing, Department of Electrical and Computer Engineering, North Carolina State University, Dec. 1985, CCSP-TR-85/18, 91 pages. |