This application relates to the technical field of audio processing, and in particular, to a multi-channel echo cancellation technology.
In many scenarios of audio processing, such as a video conference system and a hands-free telephone, multi-channel audio signals emitted by many people usually occurs at the same time. To clearly hear the multi-channel audio signals emitted by the many people at the same time, a voice communication device needs to perform echo cancellation on the obtained multi-channel audio signals. For example, assuming that in an A-terminal device and a B-terminal device that emit audio signals at the same time, the A terminal includes a microphone and a loudspeaker, and the B terminal also includes a microphone and a loudspeaker. A sound emitted by the loudspeaker of the B terminal may be transmitted to the A terminal through the microphone of the B terminal, resulting in an unnecessary echo, which needs to be cancelled.
Current echo cancellation method usually has large delay due to multiple echo paths, especially long echo paths. To reduce the delay, the order of a filter has to be increased, so that the calculation complexity of multi-channel echo cancellation is very high and the multi-channel echo cancellation cannot be really applied to production.
In accordance with the disclosure, there is provided a multi-channel echo cancellation method including obtaining far-end audio signals outputted by channels, obtaining a filter coefficient matrix corresponding to a kth frame of microphone signal outputted by a target microphone and including frequency domain filter coefficients of filter sub-blocks corresponding to the channels, performing frame-partitioning and block-partitioning processing on the far-end audio signals to determine a far-end frequency domain signal matrix corresponding to the kth frame of microphone signal and including far-end frequency domain signals of the filter sub-blocks, performing filtering processing according to the filter coefficient matrix and the far-end frequency domain signal matrix to obtain an echo signal in the kth frame of microphone signal, and performing echo cancellation according to a frequency domain signal of the kth frame of microphone signal and the echo signal in the kth frame of microphone signal to obtain a near-end audio signal outputted by the target microphone.
Also in accordance with the disclosure, there is provided a computer device including a memory storing program codes and a processor configured to execute the program codes to obtain far-end audio signals outputted by channels, obtain a filter coefficient matrix corresponding to a kth frame of microphone signal outputted by a target microphone and including frequency domain filter coefficients of filter sub-blocks corresponding to the channels, perform frame-partitioning and block-partitioning processing on the far-end audio signals to determine a far-end frequency domain signal matrix corresponding to the kth frame of microphone signal and including far-end frequency domain signals of the filter sub-blocks, perform filtering processing according to the filter coefficient matrix and the far-end frequency domain signal matrix to obtain an echo signal in the kth frame of microphone signal, and perform echo cancellation according to a frequency domain signal of the kth frame of microphone signal and the echo signal in the kth frame of microphone signal to obtain a near-end audio signal outputted by the target microphone.
Also in accordance with the disclosure, there is provided a non-transitory computer-readable storage medium storing program codes that, when executed by a processor, cause the processor to obtain far-end audio signals outputted by channels, obtain a filter coefficient matrix corresponding to a kth frame of microphone signal outputted by a target microphone and including frequency domain filter coefficients of filter sub-blocks corresponding to the channels, perform frame-partitioning and block-partitioning processing on the far-end audio signals to determine a far-end frequency domain signal matrix corresponding to the kth frame of microphone signal and including far-end frequency domain signals of the filter sub-blocks, perform filtering processing according to the filter coefficient matrix and the far-end frequency domain signal matrix to obtain an echo signal in the kth frame of microphone signal, and perform echo cancellation according to a frequency domain signal of the kth frame of microphone signal and the echo signal in the kth frame of microphone signal to obtain a near-end audio signal outputted by the target microphone.
The embodiments of this application will be described with reference to the accompanying drawings.
Referring to
To enable the far-end user to hear the near-end user clearly (that is, the near-end audio signal), acoustic echo cancellation (AEC) is required, and will be referred to as echo cancellation for convenience of description. In related arts, large delay may be caused by an echo path and the like, and to reduce the delay, the order of the filter has to be increased, which may make the calculation complexity very high.
To solve the above technical problems, the embodiments of this application provide a multi-channel echo cancellation method. The method does not need to increase the order of the filter, but transforms the calculation into a frequency domain and combines the calculation with frame-partitioning and block-partitioning processing, thereby reducing the delay caused by the echo path and the like, greatly reducing the calculation amount and calculation complexity of multi-channel echo cancellation, and achieving better convergence performance.
The method provided by the embodiments of this application may be applied to a related application of voice communication scenarios or a related voice communication device, in particular to various scenarios of multi-channel voice communication requiring echo cancellation, such as an audio and video conference application, an online classroom application, a telemedicine application, and a voice communication device capable of performing hands-free calls. These are not limited by the embodiments of this application.
The method provided by the embodiments of this application may relate to the field of cloud technologies, such as cloud computing, cloud application, cloud education, and cloud conference.
For ease of understanding, a system architecture for implementing the multi-channel echo cancellation method provided by the embodiments of this application is described with reference to
The terminal 202 may include a loudspeaker 2021 and a microphone 2022. Because the microphone 2012 collects the far-end audio signal played by the multiple loudspeakers 2011 while collecting the near-end audio signal, to prevent a user corresponding to the terminal 202 from hearing his/her own echo, the terminal 202 may perform the multi-channel echo cancellation method provided by the embodiments of this application. This embodiment does not limit the number of the loudspeaker 2021 and the microphone 2022 included in the terminal 202; and the number of the loudspeaker 2021 may be one or more, and the number of the microphone 2022 may also be one or more.
Each of the terminal 201 and the terminal 202 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart loudspeaker, a smart watch, a vehicle-mounted terminal, a smart television, a dedicated audio and video conference device and the like, but are not limited thereto.
A server may support the terminal 201 and the terminal 202 in a background to provide a service (such as the audio and video conference) for the user. The server may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server providing a cloud computing service. The terminal 201, the terminal 202 and the server may be connected directly or indirectly through wired or wireless communication, which is not limited in this application.
In the embodiments of this application, the terminal 202 may obtain multiple far-end audio signals, where the multiple far-end audio signals are far-end audio signals respectively outputted by multiple channels. The multiple channels may be channels formed by the multiple loudspeakers 2011 in
The terminal 202 may perform echo cancellation through frame partitioning and block partitioning. Therefore, in a case that a target microphone outputs a kth microphone signal, the terminal 202 may obtain a first filter coefficient matrix corresponding to the kth microphone signal, where the first filter coefficient matrix includes frequency domain filter coefficients of filter sub-blocks corresponding to the multiple channels, thereby performing block partitioning on the filter to obtain the filter sub-blocks.
Then, the terminal 202 performs frame-partitioning and block-partitioning processing according to the multiple far-end audio signals, and determines a far-end frequency domain signal matrix corresponding to the kth frame of microphone signal (a frame of microphone signal is also referred to as a “microphone signal frame”), where the far-end frequency domain signal matrix includes the far-end frequency domain signals of the filter sub-blocks corresponding to the multiple channels. In this way, during filtering processing according to the first filter coefficient matrix and the far-end frequency domain signal matrix to obtain the echo signal in the kth frame of microphone signal, since a calculation is transformed into a frequency domain, and a Fourier transform has rapidness and is combined with the frame-partitioning and block-partitioning processing, it is unnecessary to wait for all the far-end audio signals to be outputted by the target microphone, so that delay caused by an echo path and the like is reduced, and the calculation amount and calculation complexity are greatly reduced.
Thereafter, the terminal 202 may quickly realize echo cancellation according to the frequency domain signal of the kth frame of microphone signal and the echo signal in the kth frame of microphone signal, and obtain the near-end audio signal outputted by the target microphone.
In
It may be understood that the multi-channel echo cancellation method provided by the embodiments of this application may be integrated into an echo canceller, and the echo canceller is installed in the related application of the voice communication scenario or the related voice communication device, so as to cancel the echo of other users collected by the near-end voice communication device, retain only the voice spoken by local users, and improve voice communication experience.
The multi-channel echo cancellation method performed by the far-end terminal is described with reference to the accompanying drawings. Referring to
S301: Obtain multiple far-end audio signal, where the multiple far-end audio signals are the audio signals outputted by the multiple channels, respectively.
This embodiment takes the scenario of the audio and video conference as an example, where the far-end terminal and the near-end terminal may be any of the foregoing mentioned devices capable of performing audio and video conference, for example, may be a dedicated audio and video conference device. The dedicated audio and video conference device supports multi-channel recording and playing, thereby greatly improving the call experience of people. Referring to
After being picked up again by the microphone, the far-end audio signals transmitted by the multiple loudspeakers will be transmitted back to the far-end terminal to form the echo signal. For example, in a room where the audio and video conference is held, the far-end audio signals played by the loudspeakers are reflected by obstacles such as walls, floors and ceilings, and the reflected voices and direct voices (that is, unreflected far-end audio signals) are picked up by microphones to form echo signals, so the multi-channel echo cancellation is required. In this scenario, the echo canceller may be installed in the dedicated audio and video conference device.
Taking the audio and video conference application as an example, in the audio and video conference application, user A enters an online conference room, and user A turns on the microphone and starts to speak, as shown in a user interface in
Only one specific use method is shown here, and other methods, for example, changing icons, changing prompt text content or text position on the user interface are also covered in this application. In addition, the example is the user interface corresponding to the scenario where many people conduct the audio and video conference, and other scenarios, such as the online classroom application and the telemedicine application, are presented in a similar way to the above, which are not elaborated here.
In a multi-channel scenario, that is, the near-end terminal includes multiple loudspeakers, the multiple far-end audio signals are the audio signals outputted by the multiple loudspeakers included in the near-end terminal, and the far-end terminal may obtain multiple far-end audio signals. The embodiments of this application provide multiple exemplary methods to obtain the multiple far-end audio signals. One method may be that the far-end terminal directly determines the multiple audio signals according to the voice emitted by a corresponding user, and the other method may be that the near-end terminal determines the multiple far-end audio signals outputted by the loudspeaker, so that the far-end terminal may obtain the multiple far-end audio signals from the near-end terminal.
S302: Obtain the first filter coefficient matrix corresponding to the kth frame of microphone signal outputted by the target microphone, where the first filter coefficient matrix includes the frequency domain filter coefficients of the filter sub-blocks corresponding to the multiple channels.
During echo cancellation, a filter is usually used for simulating the echo path, so that the echo signal obtained by the far-end audio signal passing through the echo path can be simulated by a processing result of the filter on the far-end audio signal (the processing result may be obtained through operation of the filter coefficient of the filter and the far-end audio signal, such as the product operation). To reduce the delay of the echo cancellation, the far-end terminal may perform echo cancellation through frame partitioning and block partitioning. Therefore, During the echo cancellation performed for the kth microphone signal, the filter may be subjected to block partitioning.
To perform block-partitioning processing on the filter is to partition the filter with a certain length into a plurality of parts, each part may be referred to as a filter sub-block, and each filter sub-block has a same length. For example, assuming that the length of the filter is N, and the filter is partitioned into P filter sub-blocks, the length of each filter sub-block is L=N/P. By performing block-partitioning processing on the filter, the original processing of an input far-end audio signal by one filter may be transformed into a parallel processing of the far-end audio signal by P parallel filter sub-blocks.
The filtering function of the filter is embodied by filter coefficients. In a case that the filter is partitioned into multiple filter sub-blocks, each filter sub-block may filter a corresponding far-end audio signal in parallel. The filtering function of the filter sub-block also needs to be embodied by corresponding filter coefficients obtained after the block-partitioning processing, so that each filter sub-block has the corresponding filter coefficient. Therefore, for each filter sub-block, the filter coefficient is used for operating with the far-end audio signal on the filter sub-block, thereby realizing the parallel processing of the far-end audio signals by the P parallel filter sub-blocks.
In addition, because the Fourier transform is fast and combined with the frame-partitioning and block-partitioning processing, the delay caused by the echo path and the like may be better reduced, and the calculation amount and calculation complexity are greatly reduced. Therefore, the embodiments of this application may transform the filter coefficient of each filter sub-block to the frequency domain through the Fourier transform, thereby obtaining the frequency domain filter coefficient of each filter sub-block. The frequency domain filter coefficients of the filter sub-blocks corresponding to the multiple channels may form the filter coefficient matrix. In this way, for each frame of microphone signal used for performing operations with the corresponding far-end audio signal, that is, the filter coefficient matrix corresponding to the frame of microphone signal.
Based on this, when the kth frame of microphone signal outputted by the target microphone arrives, the far-end terminal may obtain the first filter coefficient matrix corresponding to the kth frame of microphone signal, where the first filter coefficient matrix includes the frequency domain filter coefficients of the filter sub-blocks corresponding to the multiple channels. The target microphone here refers to the microphone on the near-end terminal. The kth frame of microphone signal outputted by the target microphone is the kth frame of microphone signal collected by the target microphone, including the near-end audio signal and the echo signal (that is, the echo signal generated based on the multiple far-end audio signals), where k is an integer greater than or equal to 1.
In the embodiments of this application, the first filter coefficient matrix corresponding to the kth frame of microphone signal may be acquired by obtaining a second filter coefficient matrix corresponding to a (k−1)th frame of microphone signal. The second filter coefficient matrix includes the frequency domain filter coefficients of the filter sub-blocks corresponding to each channel when the target microphone outputs the (k−1)th frame of microphone signal, where k is an integer greater than 1. Further, the second filter coefficient matrix is iteratively updated to obtain the first filter coefficient matrix. That is, when a current frame of microphone signal (for example, the kth frame of microphone signal) arrives, the first filter coefficient matrix used for the multi-channel echo cancellation of the current frame of microphone signal may be iteratively updated according to the second filter coefficient matrix corresponding to a previous frame of microphone signal (for example, the (k−1)th frame of microphone signal), so that the filter coefficient matrix is continuously optimized and quickly converges.
The filter may be a Kalman filter and the filter sub-blocks may be obtained by performing block-partitioning processing on a frequency domain Kalman filter, where the frequency domain Kalman filter includes at least two filter sub-blocks. Block-partitioned frequency-domain Kalman filtering is performed through a block-partitioned frequency-domain Kalman filter without performing nonlinear preprocessing on the far-end audio signal and without performing double-end intercom detection, thereby avoiding correlation interference in the multi-channel echo cancellation, reducing the calculation complexity and improving the convergence efficiency.
To implement the steps shown in S302-305 and obtain the first filter coefficient matrix by iterative updating, a frequency domain observation signal model and a frequency domain state signal model may be constructed first. The principle of constructing the observation signal model and the state signal model is described below with reference to the block diagram of a multi-channel recording and playing system shown in
The microphone signal y(n) at a discrete sampling time n is expressed as:
y(n)=Σi=0H−1xiT(n)wi(n)+v(n) (1)
Superscript T represents transposition, xi(n)=[xi(n), . . . , xi(n−N+1)]T represents an input signal vector of an ith channel with a length of N, that is, a vector representation of the far-end audio signal, referring to X0, . . . , XH in
Then, the observation signal model of the frequency domain is constructed based on a formula shown in (1). Frequency domain signal processing is based on frame processing. In a case that k represents the frame number, the echo path wi(n) is divided into P sub-blocks with equal length, and each sub-block may be referred to as the filter sub-block. In this scenario, when the target microphone outputs the kth frame of microphone signal, the filter coefficient of a pth filter sub-block corresponding to the ith channel is expressed as:
w
i,p(k)=[Wi,pL(k), . . . ,wi,pL+L−1(k)]T (2)
where L represents the length of each filter sub-block, and the length of each filter sub-block is L=N/P. wi,p(k) is transformed to the frequency domain to obtain the following formula:
where F is a Fourier transform matrix of M×M, (M=2L), and 0L×1 represents an all-zero column vector with the number of dimensions being L×1.
Further, based on xi(n)=[xi(n), . . . , xi(n−N+1)]T, frame-partitioning and block-partitioning processing is performed on the far-end audio signal of the pth filter sub-block of the ith channel, and the far-end audio signal is transformed to the frequency domain:
x
i,p(k)=diag{F[xi(kL−pL−L), . . . ,xi(kL−pL+L−1)]T} (4)
where diag{ } represents the operation of transforming a vector into a diagonal matrix. F[ ] represents the Fourier transform.
Based on the formula shown in (1), the kth frame of microphone signal is transformed into the frequency domain signal to obtain the following formula:
Y(k)=Σi=0H−1Σp=0p−1G01Xi,p(k)Wi,p(k)+V(k) (5)
where Y(k)=F[01×L,y(kL), . . . ,y(kL+L−1)]T and V(k)=F[01×L, v(kL), . . . ,v(kL+L−1)]T are respectively the frequency domain signal of the k-th frame of microphone signal and the frequency domain signal of the near-end audio signal, and G01 is a windowing matrix, thereby ensuring that a result of cyclic convolution is consistent with that of linear convolution, and 01×L is an all-zero matrix with the number of dimensions of 1×L.
G01 may be expressed as:
0L represents the all-zero matrix with the number of dimensions of L×L, IL represents an identity matrix with the number of dimensions of L×L, and F represents the Fourier transform matrix. Further, the formula shown in (5) is rewritten into a more compact matrix-vector product form:
Y(k)=X(k)W(k)+V(k) (7)
where X(k)=G01[X1,0(k), . . . , X1,P−1(k), . . . , XH,0(k), . . . , XH,P−1(k)] is a matrix composed of the frequency domain signals of the far-end audio signals of H channels, and may be referred to as the far-end frequency domain signal matrix; X(k)=G01[X1,0(k), . . . , X1,P−1(k), . . . , XH,0(k), . . . , XH,P−1(k)] is the first filter coefficient matrix corresponding to the kth frame of microphone signal composed of all the filter sub-blocks of the H channels. So far, the frequency domain observation signal model under the framework of the multi-channel echo cancellation is constructed.
Then, a frequency domain state signal model is constructed. In a real acoustic environment, the change of the echo path with time is very complex, and it is almost impossible to describe this change accurately with a model. Therefore, the embodiments of this application use a first-order Markov model to model the echo path, that is, the frequency domain state signal model:
W(k)=AW(k−1)+ΔW(k) (8)
where A is a transition parameter that does not change with time, and W(k−1) is the second filter coefficient matrix corresponding to the (k−1)th frame of microphone signal; ΔW(k)=[ΔW1.0T(k), . . . , ΔW1,P−1(k), . . . , ΔWH,0T(k), . . . , ΔWH,P−1T(k)]T represents a process noise vector with the number of dimensions being HLP×1, which has a zero mean value and is a random signal independent of W(k).
The covariance matrix of ΔW(k) is:
ψΔ(k)=E[ΔW(k)ΔWΦ(k)] (9)
where Φ represents a conjugate transposition, E represents computational expectation, and the covariance matrix of ΔW includes (HP)2 submatrices with the number of dimensions being N×N. Further, assuming that the process noises between different channels are independent of each other, ψΔ(k) may be approximated as the diagonal matrix:
ψΔ(k)≈(1−A2)diag{W(k)⊙(WΦ(k)} (10)
where ⊙ represents dot product operation, and diag{ } represents the operation of transforming the vector into the diagonal matrix. In essence, the above formula describes the change of the echo path with time by using the transfer parameter A and the energy of a real echo path. In a case that a noise signal covariance matrix (observation covariance matrix) can be accurately estimated, a process noise covariance matrix estimation method provided by the formula (10) may better cope with larger echo path changes, even for the larger parameter A.
Based on the frequency domain observation signal model and frequency domain state signal model established by the above methods, an accurate partitioned-block frequency domain Kalman filtering algorithm may be derived. When the partitioned-block frequency domain Kalman filtering algorithm is applied to the multi-channel echo cancellation, the second filter coefficient matrix is updated iteratively. The first filter coefficient matrix may be obtained by obtaining the observation covariance matrix corresponding to the kth frame of microphone signal and the state covariance matrix corresponding to the (k−1)th frame of microphone signal, where the observation covariance matrix and the state covariance matrix are diagonal matrices, respectively representing the uncertainty of a residual signal prediction value estimation and a state estimation in the Kalman filtering. According to the observation covariance matrix corresponding to the kth frame of microphone signal and the state covariance matrix corresponding to the (k−1)th frame of microphone signal, a gain coefficient is calculated, where the gain coefficient represents the influence of the residual signal prediction value estimation on the state estimation. The first filter coefficient matrix is determined according to the second filter coefficient matrix, the gain coefficient and the residual signal prediction value corresponding to the kth frame of microphone signal, so that in the iterative updating process, the accuracy of the state estimation (that is, a new filter coefficient matrix such as the first filter coefficient matrix) is improved by continuously modifying the gain coefficient and the residual signal prediction value corresponding to the kth frame of microphone signal. In this case, an iterative update calculation formula of the first filter coefficient matrix may be:
i(k)=A(
where iteratively represents the second filter coefficient matrix corresponding to the (k−1)th frame of microphone signal, Ki(k) represents the gain coefficient, E(k) represents a frequency domain representation of the residual signal prediction value corresponding to the kth frame of microphone signal, and A represents the transition parameter. In a possible method, the observation covariance matrix corresponding to the kth frame of microphone signal may be obtained by the following steps: perform filtering processing according to the second filter coefficient matrix and the far-end frequency domain signal matrix, and obtain the residual signal prediction value corresponding to the kth frame of microphone signal; and calculate the observation covariance matrix corresponding to the kth microphone signal according to the residual signal prediction value corresponding to the kth microphone signal.
The filtering processing according to the second filter coefficient matrix and the far-end frequency domain signal matrix may be as follows: perform product summation on the second filter coefficient matrix and the far-end frequency domain signal matrix, and the residual signal prediction value corresponding to the kth frame of microphone signal may represent the echo signal possibly corresponding to a next frame of microphone signal predicted based on the second filter coefficient matrix. Specifically, according to the above established frequency domain observation signal model, the frequency domain of the residual signal prediction value corresponding to the kth frame of microphone signal may be determined as:
E(k)=Y(k)−Σi=0p−1
where E(k) represents the frequency domain representation of the residual signal prediction value corresponding to the kth frame of microphone signal, Y(k) represents the frequency domain signal of the kth frame of microphone signal,
When the observation covariance matrix corresponding to the kth microphone signal is calculated, the calculation may be combined with the observation covariance matrix corresponding to the (k−1)th frame of microphone signal. When the filter is converged to a steady state, the residual signal prediction value is very close to a real noise vector, so the calculation formula of the observation covariance matrix corresponding to the kth frame of microphone signal is as follows:
ψ(k)=αψS(k−1)+(1−60)diag{E(k)⊙(EΦ(k)} (13)
where ψS(k) represents the observation covariance matrix corresponding to the kth frame of microphone signal, ψS(k) represents the observation covariance matrix corresponding to the (k−1)th microphone signal, a is a smoothing factor and is set according to the actual experience, E(k) is the frequency domain representation of the residual signal prediction value corresponding to the kth frame of microphone signal, ⊙ represents dot product operation, diag{ } represents the operation of transforming the vector into the diagonal matrix, and p represents the conjugate transposition.
In this embodiment, the state covariance matrix corresponding to the (k−1)th frame of microphone signal may be obtained by calculating the state covariance matrix corresponding to the (k−1)th frame of microphone signal according to the second filter coefficient matrix.
Specifically, the calculation formula is as follows:
where Pi,j(k−1) represents the state covariance matrix corresponding to the (k−1)th frame of microphone signal, Pi,j(k−2) represents the state covariance matrix corresponding to a (k−2)th frame of microphone signal, Ki(k−1) represents the gain coefficient corresponding to (k−1)th microphone signal,
Some variables corresponding to the (k−1)th microphone signal, such as the gain coefficient and the second filter coefficient matrix, may be calculated according to the variables corresponding to the previous frame of microphone signal, or may be set initial values. Similarly, the state covariance matrix corresponding to the (k−1)th frame of microphone signal and the state covariance matrix corresponding to the (k−2)th frame of microphone signal may also be set initial values.
According to the observation covariance matrix corresponding to the kth frame of microphone signal and the state covariance matrix corresponding to the (k−1)th frame of microphone signal, the gain coefficient may be calculated by first calculating a gain estimation intermediate variable:
where DX(k) is the gain estimation intermediate variable, R is the frame shift, M is the frame length,
The formula of calculating the gain factor may be:
where Ki(k) represents the gain coefficient, Pi,j(k−1) represents the state covariance matrix corresponding to the (k−1)th frame of microphone signal,
S303: Perform frame-partitioning and block-partitioning processing on the multiple far-end audio signals to determine the far-end frequency domain signal matrix corresponding to the kth frame of microphone signal, where the far-end frequency domain signal matrix includes the far-end frequency domain signals of the filter sub-blocks corresponding to the multiple channels.
S304: Perform filtering processing according to the first filter coefficient matrix and the far-end frequency domain signal matrix to obtain the echo signal in the kth frame of microphone signal.
The far-end audio signal may include multiple frames, and the embodiments of this application are to perform the echo cancellation for each frame of microphone signal. During the echo cancellation for the kth frame of microphone signal, it is needed to select the far-end audio signal corresponding to the kth frame of microphone signal from multiple frames of far-end audio signals to realize the echo cancellation in units of frames.
In addition, to reduce the delay, in S302, the filter is subjected to block partitioning to perform parallel processing on the far-end audio signal through multiple filter sub-blocks obtained after block partitioning, that is, each filter sub-block is required to process a part of the far-end audio signal. Based on this, the far-end terminal is required to respectively perform frame-partitioning and block-partitioning processing according to the multiple far-end audio signals to obtain the far-end audio signal corresponding to the kth frame of microphone signal, and the far-end audio signal is partitioned into multiple parts with the same number as the filter sub-blocks, where each part corresponds to one filter sub-block, and multiple parts corresponding to the multiple frames of the far-end audio signals form the far-end audio signal matrix. Therefore, during the echo cancellation on the kth frame of microphone signal, parallel processing is performed by the multiple filter sub-blocks for the far-end audio signal corresponding to the kth frame of microphone signal, that is, each filter sub-block processes a corresponding part of the far-end audio signal.
Since the Fourier transform has rapidness and is combined with frame-partitioning and block-partitioning processing, it is unnecessary to wait for all the far-end audio signals to be outputted by the target microphone, thereby reducing the delay caused by the echo path and greatly reducing the calculation amount and calculation complexity. Therefore, the embodiments of this application may transform the far-end audio signal after frame-partitioning and block-partitioning processing to the frequency domain through the Fourier transform, thereby obtaining the frequency domain representation of the far-end audio signal, and correspondingly, the far-end audio signal matrix is transformed into the far-end frequency domain signal matrix.
The far-end frequency domain signal matrix includes the far-end frequency domain signals of the filter sub-blocks corresponding to the multiple channels. In this way, during the filtering processing according to the first filter coefficient matrix and the far-end frequency domain signal matrix to obtain the echo signal in the kth frame of microphone signal, the calculation is transformed into the frequency domain, thereby reducing the delay caused by the echo path and the like, and greatly reducing the calculation amount and calculation complexity.
In a possible implementation, the way to determine the far-end frequency domain signal matrix by performing frame-partitioning and block-partitioning processing according to the multiple far-end audio signals may be as follows: obtain the far-end frequency domain signal of each filter sub-block corresponding to the multiple channels according to a preset frame shift and a preset frame length by adopting an overlap reservation algorithm, and the far-end frequency domain signal of each filter sub-block corresponding to the multiple channels form the far-end frequency-domain signal matrix. The preset frame shift may be represented by R.
Based on construction process of the frequency domain observation signal model, frame-partitioning and block-partitioning processing is performed on the far-end audio signals Xh(n) of H channels to obtain a vector xh,l(k), where the vector represents the far-end audio signal of an lth filter sub-block corresponding to an hth channel, and the length of each filter sub-block is 2N (equivalent to L in the construction process of the frequency domain observation signal model), and the frame shift is N (equivalent to the preset frame shift R), which is specifically expressed as:
x
h,l(k)={Xh[(k−l−1)N], . . . ,xh[(k−l+1)N−1]}T (17)
Frame-partitioning processing is performed on the target microphone signal, such as the microphone signal collected by a tth microphone, to obtain a vector yt(k). yt(k) represents the frequency domain signal of the kth frame microphone outputted by the target microphone, specifically expressed as follows, in the description of the following steps, a microphone number t is omitted:
y
t(k)=[yt(kN), . . . ,yt(kN+N−1)]T (18)
where T represents a transpose operation.
Frame partitioning and zero filling are performed on the residual signal prediction value e(k) corresponding to the kth frame of microphone signal:
e(k)=[01×N,e(kN), . . . ,e(kN+N−1)]T (19)
where 01×N represents an all-zero matrix with the number of dimensions of 1×N, and T represents the transposition operation.
The filter coefficient is determined:
w
h(n)=[wh,0T(n), . . . ,wh,L−1(n)]T (20)
W
h,l(n)=[wh,lN(n), . . . ,Wh,(l+1)N−1(n)]T (21)
where wh(n) is the time domain representation of the filter coefficient corresponding to the hth channel, wh,l(n) is the time domain representation of the filter coefficient of an 1th filter sub-block corresponding to the hth channel, and n represents the discrete sampling time.
Fourier transform is performed respectively on time domain vectors xh,l(k) and wh,l(n) in (20) and (21) to obtain the frequency domain representations:
i(k)=Xh,l(k)=diag{Fx,h,l(k)} (22)
i(k)=Wh,l(k)=F[wh,lT(kN),01×N]T (23)
where
mod is a remainder operation, and L is the number of the filter sub-blocks (equivalent to P in the construction process of the frequency domain observation signal model).
According to the above representations, the filtering processing according to the first filter coefficient matrix and the far-end frequency domain signal matrix may be a product summation operation of
S305: Perform the echo cancellation according to the frequency domain signal of the kth frame of microphone signal and the echo signal in the kth frame of microphone signal to obtain the near-end audio signal outputted by the target microphone.
After the echo signal is obtained based on the above steps, the far-end terminal may subtract the echo signal from the frequency domain signal of the kth frame of microphone signal, thereby realizing the echo cancellation and obtaining the near-end audio signal outputted by the target microphone.
The target microphone is located on the voice communication device, where the voice communication device may include a microphone which is the target microphone. The obtained near-end audio signal outputted by the target microphone is used as the final signal to be played to the far-end user.
In some cases, the voice communication device may include multiple microphones, for example, T microphones, where T is an integer greater than 1. The target microphone is a tth microphone of the T microphones, where 0≤t≤T−1 and t is an integer. In this case, the obtained near-end audio signal outputted by the target microphone is the near-end audio signal outputted by each microphone. At this time, signal mixing may be performed on the near-end audio signals outputted by the T microphones, respectively, to obtain the target audio signal, thereby improving the quality of the target audio signal played to the far-end user through mixing.
Referring to
In some cases, because the near-end audio signal may include the voice signal and the background noise, to obtain a more clear voice signal, the background noise included in the near-end audio signal may be cancelled. Because the T microphones may output T near-end audio signals, to avoid cancelling the background noise for each near-end audio signal, the background noise included in the target audio signal may be estimated after the target audio signal is obtained, so that the background noise may be cancelled from the target audio signal to obtain the near-end voice signal.
The background noise cancellation of each near-end audio signal is avoided by performing signal mixing first and then cancelling the background noise, thereby reducing the calculation amount and improving the calculation efficiency.
It can be seen from the technical solutions that in a scenario of the multi-channel echo cancellation, the multiple far-end audio signals outputted by the multiple channels may be obtained, and when the target microphone outputs the kth frame of microphone signal, the first filter coefficient matrix corresponding to the kth frame of microphone signal may be obtained, where the first filter coefficient matrix includes the frequency domain filter coefficients of the filter sub-blocks corresponding to the multiple channels. Then, frame-partitioning and block-partitioning processing is performed according to the multiple far-end audio signals to determine a far-end frequency domain signal matrix, where the far-end frequency domain signal matrix includes the frequency domain signals of the filter sub-blocks corresponding to the multiple channels. In this way, in a case that filtering processing is performed according to the first filter coefficient matrix and the far-end frequency domain signal matrix to obtain the echo signal in the kth frame of microphone signal, calculation is transformed into a frequency domain. Due to the rapidness of Fourier transform suitable for frequency domain calculation, the Fourier transform is combined with the frame-partitioning and block-partitioning processing, so that it is unnecessary to wait for all the far-end audio signals to be outputted by the target microphone, delay caused by an echo path and the like can be reduced, and the calculation amount and calculation complexity can be greatly reduced. Then, according to the frequency domain signal of the kth frame of microphone signal and the echo signal in the kth frame of microphone signal, the echo cancellation may be implemented quickly to obtain the near-end audio signal outputted by the target microphone. According to this solution, it is unnecessary to increase the order of the filter, but the calculation is transformed into the frequency domain and is combined with the frame-partitioning and block-partitioning processing, thereby reducing the delay caused by the echo path and the like, greatly reducing the calculation amount and calculation complexity of multi-channel echo cancellation, and achieving better convergence performance.
Method 2 is to model each echo path independently, and finally copy independently modeled coefficients to a new filter. In a case that the echo path is stable, this solution may estimate each echo path more accurately. However, the essence is still a normalized least mean square (NLMS) method, which has the defects of low convergence speed, lack of stability to changing paths and the like. Furthermore, in a case that the number of channels increases, the implementation complexity will multiply.
Compared with Method 1 and Method 2, the method provided by the embodiments of this application has a significant performance advantage. It is unnecessary to perform any nonlinear preprocessing on the far-end audio signal and to adopt a double-end intercom detection method, thereby avoiding the correlation interference in the multi-channel echo cancellation, reducing the calculation complexity, and improving the convergence efficiency.
Then, taking the case where the filter is a partitioned-block frequency domain Kalman filter to perform block-partitioning frequency domain Kalman filtering as an example, in a case that the set transition parameter is A=0.9999, all the state covariance matrices are initialized to a unit matrix IN, and the performance of the multi-channel echo cancellation method (that is, the solution in
Referring to
As shown in
The implementations provided in the above aspects may be further combined to provide more implementations.
Based on the multi-channel echo cancellation method provided by the embodiment corresponding to
The acquiring unit 1201 is configured to obtain the multiple far-end audio signals, where the multiple far-end audio signals are audio signals respectively outputted by the multiple channels.
The acquisition unit 1201 is further configured to obtain the first filter coefficient matrix corresponding to the kth frame of microphone signal outputted by the target microphone, where the first filter coefficient matrix includes the frequency domain filter coefficients of the filter sub-blocks corresponding to the multiple channels, and k is an integer greater than or equal to 1.
The determining unit 1202 is configured to perform the frame-partitioning and block-partitioning processing on the multiple far-end audio signals to determine the far-end frequency domain signal matrix corresponding to the kth frame of microphone signal, where the far-end frequency domain signal matrix includes the far-end frequency domain signals of the filter sub-blocks corresponding to the multiple channels.
The filtering unit 1203 is configured to perform the filtering processing according to the first filter coefficient matrix and the far-end frequency domain signal matrix to obtain the echo signal in the kth frame of microphone signal.
The cancellation unit 1204 is configured to perform the echo cancellation according to the frequency domain signal of the kth frame of microphone signal and the echo signal in the kth frame of microphone signal to obtain the near-end audio signal outputted by the target microphone.
In a possible implementation, the acquisition unit 1201 is specifically configured to:
obtain a second filter coefficient matrix corresponding to the (k−1)th frame of microphone signal, where the second filter coefficient matrix includes the frequency domain filter coefficients of the filter sub-blocks corresponding to the multiple channels, and k is an integer greater than 1; and
update the second filter coefficient matrix iteratively to obtain the first filter coefficient matrix.
In a possible implementation, the acquisition unit 1201 is specifically configured to:
obtain the observation covariance matrix corresponding to the kth frame of microphone signal and the state covariance matrix corresponding to the (k−1)th frame of microphone signal, where the observation covariance matrix and the state covariance matrix are diagonal matrices;
calculate a gain coefficient according to the observation covariance matrix corresponding to the kth frame of microphone signal and the state covariance matrix corresponding to the (k−1)th frame of microphone signal; and
determine the first filter coefficient matrix according to the second filter coefficient matrix, the gain coefficient and the residual signal prediction value corresponding to the kth frame of microphone signal.
In a possible implementation, the acquisition unit 1201 is specifically configured to:
perform filtering processing according to the second filter coefficient matrix and the far-end frequency domain signal matrix to obtain the residual signal prediction value corresponding to the kth frame of microphone signal;
calculate the observation covariance matrix corresponding to the kth frame of microphone signal according to the residual signal prediction value corresponding to the kth frame of microphone signal; and
calculate the state covariance matrix corresponding to the (k−1)th frame of microphone signal according to the second filter coefficient matrix.
In a possible implementation, the determining unit 1202 is configured to:
obtain the far-end frequency domain signal of each filter sub-block corresponding to the multiple channels by adopting an overlap reservation algorithm according to a preset frame shift and a preset frame length; and
use the far-end frequency domain signal of each filter sub-block corresponding to the multiple channels to form the far-end frequency domain signal matrix.
In a possible implementation, the target microphone is located on the voice communication device. The voice communication device includes T microphones, where T is an integer greater than 1, the target microphone is a tth microphone of the T microphones, 0≤t≤T−1 and t is an integer. The apparatus further includes an audio mixing unit.
The audio mixing unit is configured to perform signal mixing on the near-end audio signals respectively outputted by the T microphones to obtain a target audio signal.
In a possible implementation, the apparatus further includes an estimation unit.
The estimation unit is configured to estimate the background noise included in the target audio signal.
The cancellation unit 1204 is further configured to cancel the background noise from the target audio signal to obtain the near-end voice signal.
The embodiments of this application further provide a computer device. The computer device may be a voice communication device. For example, the voice communication device may be a terminal. Taking the case where the terminal is a smart phone as an example:
The memory 1320 may be configured to store software programs and modules, and the processor 1380 performs various functional applications and data processing of the smart phone by running the software programs and modules stored in the memory 1320. The memory 1320 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function (for example, a sound playing function and an image playing function), or the like. The data storage area may store data (such as audio data and a phone book) created according to the use of the smart phone. In addition, the memory 1320 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one disk storage device, a flash memory device or other volatile solid-state storage devices.
The processor 1380 is a control center of the smart phone, connects various parts of the whole smart phone by various interfaces and lines, and performs various functions and processes data of the smart phone by running or executing software programs and/or modules stored in the memory 1320 and recalling data stored in the memory 1320, thereby monitoring the whole smart phone. Optionally, the processor 1380 may include one or more processing units. In some embodiments, the processor 1380 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and an application program; and the modem processor mainly processes wireless communication. It may be understood that the modem processor described above may also not be integrated into the processor 1380.
In this embodiment, the processor 1380 in the smart phone may perform the multi-channel echo cancellation method provided by the embodiments of this application.
The embodiments of this application further provide a server. Referring to
The server 1400 may further include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input and output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server™, Mac OS X™, Unix™ Linux™ and FreeBSD™
In this embodiment, the steps performed by the central processor 1422 in the server 1400 may be implemented based on the structure shown in
According to one aspect of this application, a computer-readable storage medium is provided. The computer-readable storage medium is configured to store program codes, and the program codes are used for performing the multi-channel echo cancellation method in the foregoing embodiments.
According to one aspect of this application, a computer program product or a computer program is provided. The computer program product or the computer program includes a computer instruction, and the computer instruction is stored in the computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction to cause the computer device to perform the methods provided in various optional implementations of the above embodiments.
The descriptions of the process or the structure corresponding to the accompanying drawings have different emphases, and parts not detailed in a certain process or structure may be referred to the related descriptions of other processes or structures.
Terms such as “first,” “second,” “third” and “fourth” (in a case that they are present) in the specification of this application and in the above accompanying drawings are intended to distinguish similar objects but do not necessarily describe a specific order or sequence. It is to be understood that the data used in such a way is interchangeable in proper circumstances, so that the embodiments of this application described herein, for example, can be implemented in a sequence other than the sequence illustrated or described herein. In addition, the terms “including” and “having” and any variations thereof are intended to cover non-exclusive inclusion, for example, the processes, methods, systems, products, or devices including a series of steps or units are not necessarily limited to those steps or units explicitly listed, but may include steps or units not explicitly listed, or the other steps or units inherent to the processes, methods, systems, products or devices.
In several embodiments of this application, it is to be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the embodiments of the apparatus described above is only schematic, for example, division into the units is only logical function division. There may be other division, manners in actual implementation, for example, multiple units or components may be combined or integrated into other systems, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separated, and parts shown as units may or may not be physical units, that is, may be located in one location, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solutions of the embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit. The foregoing integrated unit may be implemented either in the form of hardware or in the form of software functional units.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the essence, or the part which contributes to conventional technologies, or all or part of the technical solution of this application may be embodied in the form of a software product, and the computer software product is stored in a storage medium, including several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the methods described in the embodiments of this application. The storage medium includes: any medium that can store program code, such as a USB flash disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc.
As described above, the above embodiments are only used to illustrate the technical solutions of this application, but not to limit them; although this application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art may understand that: they can still make modifications to the technical solutions described in the foregoing examples, or make equivalent replacement to some technical characteristics; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of this application.
Number | Date | Country | Kind |
---|---|---|---|
202111424702.9 | Nov 2021 | CN | national |
This application is a continuation of International Application No. PCT/CN2022/122387, filed on Sep. 29, 2022, which claims priority to a Chinese Patent Application No. 2021114247029, filed with the China National Intellectual Property Administration on Nov. 26, 2021 and entitled “MULTI-CHANNEL ECHO CANCELLATION METHOD AND RELATED APPARATUS,” which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/122387 | Sep 2022 | US |
Child | 18456054 | US |