This application relates to the field of computer technologies, and in particular, to an audio signal enhancement method and apparatus, a computer device, a storage medium and a computer program product.
In the process of encoding and decoding audio signals, quantization noise often occurs, which causes distortion of the speech synthesized by decoding. In the traditional solution, pitch filter or post-processing technology based on neural networks is usually used to enhance audio signals, to reduce the influence of quantization noise on speech quality.
It is therefore important to improve signal processing speed, reduce latency, and improve quality of speech enhancement.
According to embodiments of this application, an audio signal enhancement method and apparatus, a computer device, a storage medium and a computer program product are provided.
One aspect of the present application provides an audio signal enhancement method, performed by a computer device. The method includes decoding received speech packets sequentially to obtain a residual signal, long term filtering parameters and linear filtering parameters; filtering the residual signal to obtain an audio signal; extracting feature parameters from the audio signal, when the audio signal is a feedforward error correction frame signal; converting the audio signal into a filter speech excitation signal based on the linear filtering parameters; performing speech enhancement on the filter speech excitation signal according to the feature parameters, the long term filtering parameters and the linear filtering parameters to obtain an enhanced speech excitation signal; and performing speech synthesis to obtain an enhanced speech signal based on the enhanced speech excitation signal and the linear filtering parameters.
A computer device, including a memory and a processor, the memory storing a computer program, the processor, when executing the computer program, implementing the following steps: decoding received speech packets sequentially to obtain a residual signal, long term filtering parameters and linear filtering parameters; filtering the residual signal to obtain an audio signal; extracting feature parameters from the audio signal, when the audio signal is a feedforward error correction frame signal; converting the audio signal into a filter speech excitation signal based on the linear filtering parameters; performing speech enhancement on the filter speech excitation signal according to the feature parameters, the long term filtering parameters and the linear filtering parameters to obtain an enhanced speech excitation signal; and performing speech synthesis to obtain an enhanced speech signal based on the enhanced speech excitation signal and the linear filtering parameters.
A non-transitory computer-readable storage medium is provided, storing a computer program, the computer program, when executed by a processor, implementing the following steps: decoding received speech packets sequentially to obtain a residual signal, long term filtering parameters and linear filtering parameters; filtering the residual signal to obtain an audio signal; extracting feature parameters from the audio signal, when the audio signal is a feedforward error correction frame signal; converting the audio signal into a filter speech excitation signal based on the linear filtering parameters; performing speech enhancement on the filter speech excitation signal according to the feature parameters, the long term filtering parameters and the linear filtering parameters to obtain an enhanced speech excitation signal; and performing speech synthesis to obtain an enhanced speech signal based on the enhanced speech excitation signal and the linear filtering parameters.
Details of one or more embodiments of this application are provided in the accompanying drawings and descriptions below. Other features and advantages of this application are illustrated in the specification, the accompanying drawings, and the claims.
The accompanying drawings described herein are used to provide a further understanding of this application, and form a part of this application. Exemplary embodiments of this application and descriptions thereof are used to explain this application, and do not constitute any inappropriate limitation to this application. In the appended drawings:
To make objectives, technical solutions, and advantages of this application clearer and more understandable, this application is further described in detail below with reference to the accompanying drawings and the embodiments. It is to be understood that the specific embodiments described herein are only used for explaining this application, and are not used for limiting this application.
Before describing an audio signal enhancement method provided in this application, a speech generation model will be described first. Referring to a speech generation model based on excitation signals shown in
(1) At the trachea, a noise-like impact signal with a certain energy is generated, which corresponds to the excitation signal in the speech generation model based on excitation signals.
(2) The impact signal impacts the vocal cords of humans to make the vocal cords produce quasi-periodic opening and closing, which is amplified by the oral cavity to produce sound. This sound corresponds to filters in the speech generation model based on excitation signals.
In this process, considering the characteristics of sound, the filters in the speech generation model based on excitation signals are divided into long term prediction (LTP) filters and linear predictive coding (LPC) filters. The LTP filter enhances the audio signal based on long term correlations of speech, and the LPC filter enhances the audio signal based on short term correlations. Specifically, for quasi-periodic signals such as voiced sound, in the speech generation model based on excitation signals, the excitation signals respectively impact the LTP filter and the LPC filter. For aperiodic signals such as unvoiced sound, the excitation signal will only impact the LPC filter.
The solutions provided in the embodiments of this application relate technologies such as ML of AI, and are specifically described by using the following embodiments. The audio signal enhancement method provided by this application is performed by a computer device, and can be specifically applied to an application environment shown in
The terminal 202 may, but is not limited to, various personal computers, laptops, smartphones, tablets and portable wearable devices. The server 204 may be an independent physical server, or may be a server cluster including a plurality of physical servers or a distributed system, or may be a cloud server providing basic cloud computing services, such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a content delivery network (CDN), big data, and an artificial intelligence platform.
In an embodiment, as shown in
S302: Decode received speech packets sequentially to obtain a residual signal, long term filtering parameters and linear filtering parameters; and filter the residual signal to obtain an audio signal.
The received speech packets may be speech packets in an anti-packet loss scenario based on feedforward error correction (FEC).
Feedforward error correction is an error control technique. Before a signal is sent to the transmission channel, it is encoded in advance according to a certain algorithm to add redundant codes with the characteristics of the signal, and the received signal is decoded according to the corresponding algorithm at the receiving end to find out the error code generated in the transmission process and correct it.
Redundant codes may also be called redundant information. In the embodiment of this application, with reference to
Specifically, when receiving the speech packet, the terminal stores the received speech packet in a cache, fetches the speech packet corresponding to the speech frame to be played from the cache, and decodes and filters the speech packet to obtain the audio signal. When the speech packet is a packet adjacent to the historical speech packet decoded at the previous moment and the historical speech packet decoded at the previous moment has no anomalies, the obtained audio signal is directly outputted, or the audio signal is enhanced to obtain an enhanced speech signal and the enhanced speech signal is outputted. When the speech packet is not the packet adjacent to the historical speech packet decoded at the previous moment, or when the speech packet is the packet adjacent to the historical speech packet decoded at the previous moment but the historical speech packet decoded at the previous moment has anomalies, the audio signal is enhanced to obtain an enhanced speech signal and the enhanced speech signal is outputted. The enhanced speech signal carries the audio signal corresponding to the packet adjacent to the historical speech packet decoded at the previous moment.
The decoding may specifically be entropy decoding, which is a decoding solution corresponding to entropy encoding. Specifically, when the sending end encodes the audio signal, the audio signal may be encoded by the entropy encoding solution to obtain a speech packet. Thereby, when the receiving end receives the speech packet, the speech packet may be decoded by the entropy encoding solution.
In one embodiment, when receiving the speech packet, the terminal decodes the received speech packet to obtain a residual signal and filter parameters, and performs signal synthesis filtering on the residual signal based on the filter parameters to obtain the audio signal. The filter parameters include long term filtering parameters and linear filtering parameters.
Specifically, when encoding the current frame audio signal, the sending end analyzes the previous frame audio signal to obtain filter parameters, configure parameters of the filters based on the obtained filter parameters, performs analysis filtering on the current frame audio signal through the configured filters to obtain a residual signal of the current frame audio signal, encodes the audio signal by using the residual signal and the filter parameters obtained by analysis to obtain a speech packet, and sends the speech packet to the receiving end. Thereby, after receiving the speech packet, the receiving end decodes the received speech packet to obtain the residual signal and the filter parameters, and performs signal synthesis filtering on the residual signal based on the filter parameters to obtain the audio signal.
In one embodiment, the filter parameters include a linear filtering parameter and a long term filtering parameter. When encoding the current frame audio signal, the sending end analyzes the previous frame audio signal to obtain linear filtering parameters and long term filtering parameters, performs linear analysis filtering on the current frame audio signal based on the linear filtering parameters to obtain a linear filtering excitation signal, then performs long term analysis filtering on the linear filtering excitation signal based on the long term filtering parameters to obtain the residual signal corresponding to the current frame audio signal, encodes the current frame audio signal based on the residual signal and the linear filtering parameters and long term filtering parameters obtained by analysis to obtain a speech packet, and sends the speech packet to the receiving end.
Specifically, the performing the linear analysis filtering on the current frame audio signal based on the linear filtering parameters specifically includes: configuring parameters of linear predictive coding filters based on the linear filtering parameters, and performing linear analysis filtering on the audio signal by the parameter-configured linear predictive coding filters to obtain a linear filtering excitation signal. The linear filtering parameters include a linear filtering coefficient and an energy gain value. The linear filtering coefficient may be denoted as LPC AR, and the energy gain value may be denoted as LPC gain. The formula of the linear predictive coding filter is as follows:
e(n) is the linear filtering excitation signal corresponding to the current frame audio signal, s(n) is the current frame audio signal, p is the number of sampling points included in each frame audio signal, ai is the linear filtering coefficient obtained by analyzing the previous frame audio signal, and sadj(n−i) is the energy-adjusted state of the previous frame audio signal s(n−i) of the current frame audio signal s(n). sadj(n−i) may be obtained by the following formula:
s
adj(n−i)=gainadjgs(n−i) (2)
s(n−i) is the previous frame audio signal of the current frame audio signal s(n), and gainadj is the energy adjustment parameter of the previous frame audio signal s(n−i). gainadj may be obtained by the following formula:
gain(n) is the energy gain value corresponding to the current frame audio signal, and gain(n−i) is the energy gain value corresponding to the previous frame audio signal.
The performing the long term analysis filtering on the linear filtering excitation signal based on the long term filtering parameters specifically includes: configuring parameters of the long term prediction filter based on the long term filtering parameters, and performing long term analysis filtering on the residual signal by the parameter-configured long term prediction filter to obtain a residual signal corresponding to the current frame audio signal. The long term filtering parameters include a pitch period and a corresponding magnitude gain value. The pitch period may be denoted as LTP pitch, and the corresponding magnitude gain value may be denoted as LTP gain. The frequency domain of the long term prediction filter is expressed as follows, where the frequency domain can be denoted as Z domain:
p(z)=1−γz−T (4)
In the formula above, p(z) is the magnitude-frequency response of the long term prediction filter, z is the twiddle factor of frequency domain transformation, γ is the magnitude gain value LTP gain, and T is the pitch period LTP pitch.
The time domain of the long term prediction filter is expressed as follows:
δ(n)=e(n)−γe(n−T) (5)
δ(n) is the residual signal corresponding to the current frame audio signal, e(n) is the linear filtering excitation signal corresponding to the current frame audio signal, γ is the magnitude gain value LTP gain, T is the pitch period LTP pitch, and e(n−T) is the linear filtering excitation signal corresponding to the audio signal of the previous pitch period of the current frame audio signal.
In one embodiment, the filter parameters decoded by the terminal includes long term filtering parameters and linear filtering parameters, and the signal synthesis filtering includes long term synthesis filtering based on the long term filtering parameters and linear synthesis filtering based on the linear filtering parameters. After decoding the speech packet to obtain the residual signal, the long term filtering parameters and the linear filtering parameters, the terminal performs long term synthesis filtering on the residual signal based on the long term filtering parameters to obtain the long term filtering excitation signal, and then performs linear synthesis filtering on the long term filtering excitation signal based on the linear filtering parameters to obtain the audio signal.
In one embodiment, after obtaining the residual signal, the terminal splits the obtained residual signal into a plurality of subframes to obtain a plurality of sub-residual signals, performs long term synthesis filtering respectively on each sub-residual signal based on the corresponding long term filtering parameters to obtain a long term filtering excitation signal corresponding to each subframe, and then combines the long term filtering excitation signals corresponding to the subframes each in a chronological order of the subframes to obtain the corresponding long term filtering excitation signal.
For example, when a speech packet corresponds to a 20 ms audio signal, that is, the obtained residual signal has a frame length of 20 ms, the residual signal may be split into 4 subframes to obtain four 5 ms sub-residual signals, long term synthesis filtering may be performed on each 5 ms sub-residual signal respectively based on the corresponding long term filtering parameters to obtain four 5 ms long term filtering excitation signals, and the four 5 ms long term filtering excitation signals may be combined in a chronological order of the subframes to obtain one 20 ms long term filtering excitation signal.
In one embodiment, after obtaining the long term filtering excitation signal, the terminal splits the obtained long term filtering excitation signal into a plurality of subframes to obtain a plurality of sub-long term filtering excitation signals, performs linear synthesis filtering respectively on each sub-long term filtering excitation signal based on the corresponding linear filtering parameters to obtain a sub-linear filtering excitation signal corresponding to each subframe, and then combines the sub-linear filtering excitation signals corresponding to the subframes each in a chronological order of the subframes to obtain the corresponding linear filtering excitation signal.
For example, when a speech packet corresponds to a 20 ms audio signal, that is, the obtained long term filtering excitation signal has a frame length of 20 ms, the long term filtering excitation signal may be split into two subframes to obtain two 10 ms sub-long term filtering excitation signals, linear synthesis filtering may be performed on each 10 ms sub-long term filtering excitation signal respectively based on the corresponding linear filtering parameters to obtain two 10 ms sub-audio signals, and then the two 10 ms sub-audio signals may be combined in a chronological order of the subframes to obtain one 20 ms audio signal.
S304: Extract, when the audio signal is a feedforward error correction frame signal, feature parameters from the audio signal.
The audio signal is a feedforward error correction frame signal means that an audio signal of the historical adjacent frame of the audio signal has anomalies. The audio signal of the historical adjacent frame having anomalies specifically includes: the speech packet corresponding to the audio signal of the historical adjacent frame is not received, or the received speech packet corresponding to the audio signal of the historical adjacent frame is not decoded normally. The feature parameters include a cepstrum feature parameter.
In one embodiment, after decoding and filtering the received speech packet to obtain the audio signal, the terminal determines whether a historical speech packet decoded before the speech packet is decoded has data anomalies, and determines, when the decoded historical speech packet has data anomalies, the current audio signal obtained after the decoding and the filtering is the feedforward error correction frame signal.
Specifically, the terminal determines whether a historical audio signal corresponding to the historical speech packet decoded at the previous moment before the speech packet is decoded is a previous frame audio signal of the audio signal obtained by decoding the speech packet, and if so, determines that the historical speech packet has no data anomalies, and if not, determines that the historical speech packet has data anomalies.
In this embodiment, the terminal determines whether the current audio signal obtained by decoding and filtering is the feedforward error correction frame signal by determining whether the historical speech packet decoded before the current speech packet is decoded has data anomalies, and thereby can, if the audio signal is the feedforward error correction frame signal, enhance the audio signal to further improve the quality of the audio signal.
In one embodiment, when the audio signal obtained by decoding is the feedforward error correction frame signal, feature parameters are extracted from the audio signal obtained by decoding. The feature parameters extracted may specifically be a cepstrum feature parameter. This process specifically includes the following steps: performing Fourier transform on the audio signal to obtain a Fourier-transformed audio signal; performing logarithm processing on the Fourier-transformed audio signal to obtain a logarithm result; and performing inverse Fourier transform on the obtained logarithm result to obtain the cepstrum feature parameter. Specifically, the cepstrum feature parameter may be extracted from the audio signal according to the following formula:
C(n) is the cepstrum feature parameter of the audio signal S(n) obtained by decoding and filtering, and S(F) is the Fourier-transformed audio signal obtained by performing Fourier transform on the audio signal S(n).
In the above embodiment, the terminal can extract the cepstrum feature parameter from the audio signal, and thereby enhance the audio signal based on the extracted cepstrum feature parameter, and improve the quality of the audio signal.
In one embodiment, when the audio signal is not a feedforward error correction frame signal, that is, when the previous frame audio signal of the current audio signal obtained by decoding and filtering has no anomalies, the feature parameters may also be extracted from the current audio signal obtained by decoding and filtering, so that the current audio signal obtained by decoding and filtering can be enhanced.
S306: Convert the audio signal into a filter speech excitation signal based on the linear filtering parameters.
Specifically, after decoding and filtering the speech packet to obtain the audio signal, the terminal may further acquire the linear filtering parameters obtained when decoding the speech packet, and perform linear analysis filtering on the obtained audio signal based on the linear filtering parameters, thereby converting the audio signal into the filter speech excitation signal.
In an embodiment, S306 specifically includes the following steps: configuring parameters of linear predictive coding filters based on the linear filtering parameters, and performing linear decomposition filtering on the audio signal by the parameter-configured linear predictive coding filters to obtain the filter speech excitation signal.
The linear decomposition filtering is also called linear analysis filtering. In the embodiment of this application, in the process of performing linear analysis filtering on the audio signal, the linear analysis filtering is performed on the audio signal of the whole frame, and there is no need to split the audio signal of the whole frame into subframes.
Specifically, the terminal may perform linear decomposition filtering on the audio signal to obtain the filter speech excitation signal according to the following formula:
D(n) is the filter speech excitation signal corresponding to the audio signal S(n) obtained after decoding and filtering the speech packet, S(n) is the audio signal obtained after decoding and filtering the speech packet, Sadj(n−i) is the energy-adjusted state of the previous frame audio signal S(n−i) of the obtained audio signal S(n), p is the number of sampling points included in each frame audio signal, and Ai is the linear filtering coefficient obtained by decoding the speech packet.
In the above embodiment, the terminal converts the audio signal into the filter speech excitation signal based on the linear filtering parameters, and thereby can enhance the filter speech excitation signal to enhance the audio signal, and improve the quality of the audio signal.
S308: Perform speech enhancement on the filter speech excitation signal according to the feature parameters, the long term filtering parameters and the linear filtering parameters to obtain an enhanced speech excitation signal.
The long term filtering parameters include a pitch period and a magnitude gain value.
In one embodiment, S308 includes the following steps: performing speech enhancement on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filtering parameters and the cepstrum feature parameter to obtain the enhanced speech excitation signal.
Specifically, the speech enhancement of the audio signal may specifically be realized by a pre-trained signal enhancement model. The signal enhancement model is a neural network (NN) model which may specifically adopt LSTM and CNN structures.
In the above embodiment, the terminal performs speech enhancement on the filter speech excitation signal according to the pitch period, the magnitude gain value, the linear filtering parameters and the cepstrum feature parameter to obtain the enhanced speech excitation signal, and thereby can enhance the audio signal based on the enhanced speech excitation signal, and improve the quality of the audio signal.
In one embodiment, the terminal inputs the feature parameters, the long term filtering parameters, the linear filtering parameters and the filter speech excitation signal into the pre-trained signal enhancement model, so that the signal enhancement model performs speech enhancement on the filter speech excitation signal based on the feature parameters to obtain the enhanced speech excitation signal.
In the above embodiment, the terminal obtains the enhanced speech excitation signal by the pre-trained signal enhancement model, and thereby can enhance the audio signal based on the enhanced speech excitation signal, and improve the quality of the audio signal and the efficiency of audio signal enhancement.
In the embodiment of this application, in the process of performing speech enhancement on the filter speech excitation signal by the pre-trained signal enhancement model, the speech enhancement is performed on the filter speech excitation signal of the whole frame, and there is no need to split the filter speech excitation signal of the whole frame into subframes.
S310: Perform speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain an enhanced speech signal.
The speech synthesis may be linear synthesis filtering based on the linear filtering parameters.
In one embodiment, after obtaining the enhanced speech excitation signal, the terminal configure parameters of the linear predictive coding filters based on the linear filtering parameters, and performs linear synthesis filtering on the enhanced speech excitation signal by the parameter-configured linear predictive coding filters to obtain the enhanced speech signal.
The linear filtering parameters include a linear filtering coefficient and an energy gain value. The linear filtering coefficient may be denoted as LPC AR, and the energy gain value may be denoted as LPC gain. The linear synthesis filtering is an inverse process of the linear analysis filtering performed at the sending end when encoding the audio signal. Therefore, the linear predictive coding filter that performs the linear synthesis filtering is also called a linear inverse filter. The time domain of the linear predictive coding filter is expressed as follows:
Senh(n) is the enhanced speech signal, Denh(n) is the enhanced speech excitation signal obtained after performing speech enhancement on the filter speech excitation signal D(n), Sadj(n−i) is the energy-adjusted state of the previous frame audio signal S(n−i) of the obtained audio signal S(n), p is the number of sampling points included in each frame audio signal, and Ai is the linear filtering coefficient obtained by decoding the speech packet.
The energy-adjusted state of the previous frame audio signal S(n−i) of the obtained audio signal S(n), Sadj(n−i), may be obtained by the following formula:
S
adj(n−i)=gainadjgS(n−i) (9)
In the formula above, Sadj(n−i) is the energy-adjusted state of the previous frame audio signal S(n−i), and gainadj is the energy adjustment parameter of the previous frame audio signal S(n−i).
In this embodiment, the terminal may obtain the enhanced speech signal by performing linear synthesis filtering on the enhanced speech excitation signal to enhance the audio signal, thereby improving the quality of the audio signal.
In the embodiment of this application, in the process of speech synthesis, the speech synthesis is performed on the enhanced speech excitation signal of the whole frame, and there is no need to split the enhanced speech excitation signal of the whole frame into subframes.
According to the above audio signal enhancement method, when receiving the speech packet, the terminal sequentially decodes and filters the speech packets to obtain the audio signal; extracts, in the case that the audio signal is the feedforward error correction frame signal, the feature parameters from the audio signal; converts the audio signal into the filter speech excitation signal based on the linear filtering coefficient obtained by decoding the speech packet; performs the speech enhancement on the filter speech excitation signal according to the feature parameters and the long term filtering parameters obtained by decoding the speech packet to obtain the enhanced speech excitation signal; and performs the speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain the enhanced speech signal, to enhance the audio signal within a short time and achieve better signal enhancement effects, thereby improving the timeliness of audio signal enhancement.
In one embodiment, as shown in
S602: Configure parameters of a long term prediction filter based on the long term filtering parameters, and perform long term synthesis filtering on the residual signal by the parameter-configured long term prediction filter to obtain a long term filtering excitation signal.
The long term filtering parameters include a pitch period and a corresponding magnitude gain value. The pitch period may be denoted as LTP pitch, and LTP pitch may also be called the pitch period. The corresponding magnitude gain value may be denoted as LTP gain. The long term synthesis filtering is performed on the residual signal by the parameter-configured long term prediction filter. The long term synthesis filtering is an inverse process of the long term analysis filtering performed at the sending end when encoding the audio signal. Therefore, the long term prediction filter that performs the long term analysis filtering is also called a long term inverse filter. That is, the long term inverse filter is used to process the residual signal. The frequency domain of the long term inverse filter corresponding to formula (1) is expressed as follows:
p−1(z) is the magnitude-frequency response of the long term inverse filter, z is the twiddle factor of frequency domain transformation, γ is the magnitude gain value LTP gain, and T is the pitch period LTP pitch.
The time domain of the long term inverse filter corresponding to formula (10) is expressed as follows:
E(n)=γE(n−T)+δ(n) (11)
In the formula above, E(n) is the long term filtering excitation signal corresponding to the speech packet, δ(n) is the residual signal corresponding to the speech packet, γ is the magnitude gain value LTP gain, T is the pitch period LTP pitch, and E(n−T) is the long term filtering excitation signal corresponding to the audio signal of the previous pitch period of the speech packet. It can be understood that in this embodiment, the long term filtering excitation signal E(n) obtained at the receiving end by performing long term synthesis filtering on the residual signal by the long term inverse filter is the same as the linear filtering excitation signal e(n) obtained by performing linear analysis filtering on the audio signal by the linear filter during the encoding at the sending end.
S604: Configure parameters of linear predictive coding filters based on the linear filtering parameters, and perform linear synthesis filtering on the long term filtering excitation signal by the parameter-configured linear predictive coding filters to obtain the audio signal.
The linear filtering parameters include a linear filtering coefficient and an energy gain value. The linear filtering coefficient may be denoted as LPC AR, and the energy gain value may be denoted as LPC gain. The linear synthesis filtering is an inverse process of the linear analysis filtering performed at the sending end when encoding the audio signal. Therefore, the linear predictive coding filter that performs the linear synthesis filtering is also called a linear inverse filter. The time domain of the linear predictive coding filter is expressed as follows:
In the formula above, S(n) is the audio signal corresponding to the speech packet, E(n) is the long term filtering excitation signal corresponding to the speech packet, Sadj(n−i) is the energy-adjusted state of the previous frame audio signal S(n−i) of the obtained audio signal S(n), p is the number of sampling points included in each frame audio signal, and Ai is the linear filtering coefficient obtained by decoding the speech packet.
The energy-adjusted state of the previous frame audio signal S(n−i) of the obtained audio signal S(n), Sadj(n−i), may be obtained by the following formula:
gainadj is the energy adjustment parameter of the previous frame audio signal S(n−i), gain(n) is the energy gain value obtained by decoding the speech packet, and gain(n−i) is the energy gain value corresponding to the previous frame audio signal.
In the above embodiment, the terminal performs the long term synthesis filtering on the residual signal based on the long term filtering parameters to obtain the long term filtering excitation signal; and performs the linear synthesis filtering on the long term filtering excitation signal based on the linear filtering parameters obtained by decoding to obtain the audio signal, and thereby can directly output the audio signal when the audio signal is not the feedforward error correction frame signal, and enhance the audio signal and output the enhanced speech signal when the audio signal is the feedforward error correction frame signal, and improve the timeliness (reduce latency) of audio signal outputting.
In one embodiment, S604 specifically includes the following steps: splitting the long term filtering excitation signal into at least two subframes to obtain sub-long term filtering excitation signals; grouping the linear filtering parameters obtained by decoding to obtain at least two linear filtering parameter sets; configuring parameters of the at least two linear predictive coding filters respectively based on the linear filtering parameter sets; inputting the obtained sub-long term filtering excitation signals respectively into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform linear synthesis filtering on the sub-long term filtering excitation signals based on the linear filtering parameter sets to obtain sub-audio signals corresponding to the subframes each; and combining the sub-audio signals in a chronological order of the subframes to obtain the audio signal.
There are two types of linear filtering parameter sets: a linear filtering coefficient set and an energy gain value set.
Specifically, when linear synthesis filtering is performed on the sub-long term filtering excitation signal corresponding to each subframe by the linear inverse filter corresponding to formula (12), in formula (12), S(n) is the sub-audio signal corresponding to any subframe, E(n) is the long term filtering excitation signal corresponding to the subframe, Sadj(n−i) is the energy-adjusted state of the previous subframe sub-audio signal S(n of the obtained sub-audio signal S(n), p is the number of sampling points included in each subframe audio signal, and Ai is the linear filtering coefficient set corresponding to the subframe. In formula (13), gainadj is the energy-adjusted state of the previous subframe sub-audio signal of the sub-audio signal, gain(n) is the energy gain value of the sub-audio signal, and gain(n−i) is the energy gain value of the previous subframe sub-audio signal of the sub-audio signal.
In the above embodiment, the terminal splits the long term filtering excitation signal into the at least two subframes to obtain the sub-long term filtering excitation signals; groups the linear filtering parameters obtained by decoding to obtain the at least two linear filtering parameter sets; configures parameters of the at least two linear predictive coding filters respectively based on the linear filtering parameter sets; inputs the obtained sub-long term filtering excitation signals respectively into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform the linear synthesis filtering on the sub-long term filtering excitation signals based on the linear filtering parameter sets to obtain sub-audio signals corresponding to the subframes each; and combines the sub-audio signals in the chronological order of the subframes to obtain the audio signal, thereby ensuring the obtained audio signal to be a good reproduction of the audio signal sent by the sending end and improve the quality of the reproduced audio signal.
In one embodiment, the linear filtering parameters include a linear filtering coefficient and an energy gain value. S604 further includes the following steps: acquiring, for the sub-long term filtering excitation signal corresponding to a first subframe in the long term filtering excitation signal, the energy gain value of a historical sub-long term filtering excitation signal of the subframe in a historical long term filtering excitation signal adjacent to the sub-long term filtering excitation signal corresponding to the first subframe; determining an energy adjustment parameter corresponding to the sub-long term filtering excitation signal based on the energy gain value corresponding to the historical sub-long term filtering excitation signal and the energy gain value of the sub-long term filtering excitation signal corresponding to the first subframe; and performing energy adjustment on the historical sub-long term filtering excitation signal based on the energy adjustment parameter to obtain the energy-adjusted historical sub-long term filtering excitation signal.
The historical long term filtering excitation signal is the previous frame long term filtering excitation signal of the current frame long term filtering excitation signal, and the historical sub-long term filtering excitation signal of the subframe in the historical long term filtering excitation signal adjacent to the sub-long term filtering excitation signal corresponding to the first subframe is the sub-long term filtering excitation signal corresponding to the last subframe of the previous frame long term filtering excitation signal.
For example, when the current frame long term filtering excitation signal is split into two subframes to obtain a sub-long term filtering excitation signal corresponding to the first subframe and a sub-long term filtering excitation signal corresponding to the second subframe, the sub-long term filtering excitation signal corresponding to the second subframe of the previous frame long term filtering excitation signal and the sub-long term filtering excitation signal corresponding to the first subframe of the current frame are adjacent subframes.
In one embodiment, after obtaining the energy-adjusted historical sub-long term filtering excitation signal, the terminal inputs the obtained sub-long term filtering excitation signal and the energy-adjusted historical sub-long term filtering excitation signal into the parameter-configured linear predictive coding filter, so that the linear predictive coding filter performs linear synthesis filtering on the sub-long term filtering excitation signal corresponding to the first subframe based on the linear filtering coefficient and the energy-adjusted historical sub-long term filtering excitation signal to obtain the sub-audio signal corresponding to the first subframe.
For example, when a speech packet corresponds to a 20 ms audio signal, that is, the obtained long term filtering excitation signal has a frame length of 20 ms, the AR coefficient obtained by decoding the speech packet is and the energy gain value obtained by decoding the speech packet is {gain1(n),gain2(n)}, the long term filtering excitation signal may be split into two subframes to obtain a first sub-filtering excitation signal E1(n) corresponding to the first 10 ms and a second sub-filtering excitation signal E2(n) corresponding to the last 10 ms. The AR coefficients are grouped to obtain an AR coefficient set 1 and an AR coefficient set 2 {Ap+1, . . . A2p−1, A2p}. The energy gain values are grouped to obtain an energy gain value set 1 and an energy gain value set 2 {gain2 (n)}. Then, the previous subframe sub-filtering excitation signal of the first sub-filtering excitation signal E1(n) is E2(n−i), the energy gain value set of the previous subframe of the first sub-filtering excitation signal E1(n) is {gain2(n−i)}, the previous subframe sub-filtering excitation signal of the second sub-filtering excitation signal E2(n) is E1(n), and the energy gain value set of the previous subframe of the second sub-filtering excitation signal E2(n) is {gain1(n)}. In this case, the sub-audio signal corresponding to the first sub-filtering excitation signal E1(n) may be calculated by substituting the corresponding parameters into formula (12) and formula (13), and the sub-audio signal corresponding to the second sub-filtering excitation signal E2(n) may be calculated by substituting the corresponding parameters into formula (12) and formula (13).
In the above embodiment, the terminal acquires, for the sub-long term filtering excitation signal corresponding to the first subframe in the long term filtering excitation signal, the energy gain value of the historical sub-long term filtering excitation signal of the subframe in the historical long term filtering excitation signal adjacent to the sub-long term filtering excitation signal corresponding to the first subframe; determines the energy adjustment parameter corresponding to the sub-long term filtering excitation signal based on the energy gain value corresponding to the historical sub-long term filtering excitation signal and the energy gain value of the sub-long term filtering excitation signal corresponding to the first subframe; and performs the energy adjustment on the historical sub-long term filtering excitation signal based on the energy adjustment parameter, inputs the obtained sub-long term filtering excitation signal and the energy-adjusted historical sub-long term filtering excitation signal into the parameter-configured linear predictive coding filter, so that the linear predictive coding filter performs the linear synthesis filtering on the sub-long term filtering excitation signal corresponding to the first subframe based on the linear filtering coefficient and the energy-adjusted historical sub-long term filtering excitation signal to obtain the sub-audio signal corresponding to the first subframe, thereby ensuring the obtained each subframe audio signal to be a good reproduction of each subframe audio signal sent by the sending end and improve the quality of the reproduced audio signal.
In one embodiment, the feature parameters include a cepstrum feature parameter. S308 includes the following steps: vectorizing the cepstrum feature parameter, the long term filtering parameters and the linear filtering parameters, and concatenating the vectorization results to obtain a feature vector; inputting the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; performing feature extraction on the feature vector by the signal enhancement model to obtain a target feature vector; and enhancing the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal.
The signal enhancement model is a multi-level network structure, specifically including a feature concatenation layer, a second feature concatenation layer, a first neural network layer and a second neural network layer. The target feature vector is an enhanced feature vector.
Specifically, Specifically, the terminal vectorizes the cepstrum feature parameter, the long term filtering parameters and the linear filtering parameters by the first feature concatenation layer of the signal enhancement model, and concatenates the vectorization results to obtain the feature vector; then inputs the obtained feature vector into the first neural network layer of the signal enhancement model; performs feature extraction on the feature vector by the first neural network layer to obtain a primary feature vector; inputs the primary feature vector and envelope information obtained by performing Fourier transform on the linear filtering coefficient in the linear filtering parameters into the second feature concatenation layer of the signal enhancement model; inputs the concatenated primary feature vector into the second neural network layer of the signal enhancement model; performs feature extraction on the concatenated primary feature vector by the second neural network layer to obtain the target feature vector; and enhances the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal.
In the above embodiment, the terminal vectorizes the cepstrum feature parameter, the long term filtering parameters and the linear filtering parameters, and concatenates the vectorization results to obtain the feature vector; inputs the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; performs the feature extraction on the feature vector by the signal enhancement model to obtain the target feature vector; and enhances the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal, and thereby can enhance the audio signal by the signal enhancement model, and improve the quality of the audio signal and the efficiency of audio signal enhancement.
In one embodiment, the terminal enhancing the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal includes: performing Fourier transform on the filter speech excitation signal to obtain a frequency domain speech excitation signal; enhancing the magnitude feature of the frequency domain speech excitation signal based on the target feature vector; and performing inverse Fourier transform on the frequency domain speech excitation signal with the enhanced magnitude feature to obtain the enhanced speech excitation signal.
Specifically, the terminal performs Fourier transform on the filter speech excitation signal to obtain the frequency domain speech excitation signal; enhances the magnitude feature of the frequency domain speech excitation signal based on the target feature vector; and performs, in combination with phase features of the non-enhanced frequency domain speech excitation signal, inverse Fourier transform on the frequency domain speech excitation signal with the enhanced magnitude feature to obtain the enhanced speech excitation signal.
As shown in
In the above embodiment, the terminal performs the Fourier transform on the filter speech excitation signal to obtain the frequency domain speech excitation signal; enhances the magnitude feature of the frequency domain speech excitation signal based on the target feature vector; and performs the inverse Fourier transform on the frequency domain speech excitation signal with the enhanced magnitude feature to obtain the enhanced speech excitation signal, and thereby can enhance the audio signal on the premise of keeping phase information of the audio signal unchanged, and improve the quality of the audio signal.
In one embodiment, the linear filtering parameters include a linear filtering coefficient and an energy gain value. The terminal the configuring parameters of the linear predictive coding filters based on the linear filtering parameters, and performing the linear synthesis filtering on the enhanced speech excitation signal by the parameter-configured linear predictive coding filters includes: configuring parameters of the linear predictive coding filter based on the linear filtering coefficient; acquiring the energy gain value corresponding to the historical speech packet decoded prior to decoding the speech packet; determining the energy adjustment parameter based on the energy gain value corresponding to the historical speech packet and the energy gain value corresponding to the speech packet; performing energy adjustment on the historical long term filtering excitation signal corresponding to the historical speech packet based on the energy adjustment parameter to obtain the adjusted historical long term filtering excitation signal; and inputting the adjusted historical long term filtering excitation signal and the enhanced speech excitation signal into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform linear synthesis filtering on the enhanced speech excitation signal based on the adjusted historical long term filtering excitation signal.
The historical audio signal corresponding to the historical speech packet is the previous frame audio signal of the current frame audio signal corresponding to the current speech packet. The energy gain value corresponding to the historical speech packet may be the energy gain value corresponding to the whole frame audio signal of the historical speech, or the energy gain value corresponding to a subframe audio signal of the historical speech packet.
Specifically, when the audio signal is not a feedforward error correction frame signal, that is, when the previous frame audio signal of the current frame audio signal is obtained by normally decoding the historical speech packet by the terminal, then the energy gain value of the historical speech packet obtained when the terminal decodes the historical speech packet can be acquired, and the energy adjustment parameter can be determined based on the energy gain value of the historical speech packet. When the audio signal is a forward error correction frame signal, that is, when the previous frame audio signal of the current frame audio signal is not obtained by normally decoding the historical speech packet by the terminal, then a compensation energy gain value corresponding to the previous frame audio signal is determined based on a preset energy gain compensation mechanism, and the compensation energy gain value is determined as the energy gain value of the historical speech packet, so that the energy adjustment parameter is determined based on the energy gain value of the historical speech packet.
In one embodiment, when the audio signal is not the feedforward error correction frame signal, the energy adjustment parameter gainadj of the previous frame audio signal S(n−i) may be obtained by the following formula:
gainadj is the energy adjustment parameter of the previous frame audio signal S(n−i), gain(n−i) is the energy gain value of the previous frame audio signal S(n−i), and gain(n) is the energy gain value of the current frame audio signal. Formula (14) is used to calculate the energy adjustment parameter based on the energy gain value corresponding to the whole frame audio signal of the historical speech.
In one embodiment, when the audio signal is not the feedforward error correction frame signal, the energy adjustment parameter gainadj of the previous frame audio signal S(n−i) may be obtained by the following formula:
gainadj is the energy adjustment parameter of the previous frame audio signal S(n−i), gainm(n−i) is the energy gain value of the mth subframe of the previous frame audio signal S(n−i), gainm(n) is the energy gain value of the mth subframe of the current frame audio signal, m is the number of subframes corresponding to each audio signal, and {gain1(n)+ . . . +gain(n)}/m is the energy gain value of the current frame audio signal. Formula (15) is used to calculate the energy adjustment parameter based on the energy gain value corresponding to the sub-frame audio signal of the historical speech.
In the above embodiment, the terminal configures parameters of the linear predictive coding filter based on the linear filtering coefficient; acquires the energy gain value corresponding to the historical speech packet decoded before the speech packet is decoded; determines the energy adjustment parameter based on the energy gain value corresponding to the historical speech packet and the energy gain value corresponding to the speech packet; performs the energy adjustment on the historical long term filtering excitation signal corresponding to the historical speech packet based on the energy adjustment parameter to obtain the adjusted historical long term filtering excitation signal; and inputs the adjusted historical long term filtering excitation signal and the enhanced speech excitation signal into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform the linear synthesis filtering on the enhanced speech excitation signal based on the adjusted historical long term filtering excitation signal, such that the audio signals of different frames can be smoothed, thereby improving the quality of the speech formed by the audio signals of different frames.
In an embodiment, as shown in
S902: Decode a speech packet to obtain a residual signal, long term filtering parameters and linear filtering parameters.
S904: Configure parameters of a long term prediction filter based on the long term filtering parameters, and perform long term synthesis filtering on the residual signal by the parameter-configured long term prediction filter to obtain a long term filtering excitation signal.
S906: Split the long term filtering excitation signal into at least two subframes to obtain sub-long term filtering excitation signals.
S908: Group the linear filtering parameters to obtain the at least two linear filtering parameter sets.
S910: Configure parameters of the at least two linear predictive coding filters respectively based on the linear filtering parameter sets.
S912: Input the obtained sub-long term filtering excitation signals respectively into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform linear synthesis filtering on the sub-long term filtering excitation signals based on the linear filtering parameter sets to obtain sub-audio signals corresponding to the subframes each.
S914: Combine the sub-audio signals in a chronological order of the subframes to obtain the audio signal.
S916: Determine whether a historical speech packet decoded before the speech packet is decoded has data anomalies.
S918: Determine, when the historical speech packet has data anomalies, that the audio signal obtained after the decoding and the filtering is a feedforward error correction frame signal.
S920: Perform, when the audio signal is the feedforward error correction frame signal, Fourier transform on the audio signal to obtain a Fourier-transformed audio signal; perform logarithm processing on the Fourier-transformed audio signal to obtain a logarithm result; and perform inverse Fourier transform on the logarithm result to obtain the cepstrum feature parameter.
S922: Configure parameters of linear predictive coding filters based on the linear filtering parameters, and perform linear decomposition filtering on the audio signal by the parameter-configured linear predictive coding filters to obtain a filter speech excitation signal.
S924: Input the feature parameters, the long term filtering parameters, the linear filtering parameters and the filter speech excitation signal into a pre-trained signal enhancement model such that the signal enhancement model performs speech enhancement on the filter speech excitation signal based on the feature parameters to obtain an enhanced speech excitation signal.
S926: Configure parameters of linear predictive coding filters based on the linear filtering parameters, and perform linear synthesis filtering on the enhanced speech excitation signal by the parameter-configured linear predictive coding filters to obtain an enhanced speech signal.
This application further provides an application scenario, and the above audio signal enhancement method is applied to the application scenario. Specifically, the audio signal enhancement method is applied to the application scenario as follows:
Taking a Fs=16000 Hz broadband signal as an example, it can be understood that this application is also applicable to scenarios with other sampling rates, such as Fs=8000 Hz, 32000 Hz or 48000 Hz. The frame length of the audio signal is set to 20 ms. For Fs=16000 Hz, it is equivalent to each frame containing 320 sample points. With reference to
It should be understood that steps in flowcharts of
In an embodiment, as shown in
The speech packet processing module 1102 is configured to decode and filter received speech packets sequentially to obtain a residual signal, long term filtering parameters and linear filtering parameters; and filter the residual signal to obtain an audio signal.
The feature parameter extraction module 1104 is configured to extract, when the audio signal is a feedforward error correction frame signal, feature parameters from the audio signal.
The signal conversion module 1106 is configured to convert the audio signal into a filter speech excitation signal based on the linear filtering parameters.
The speech enhancement module 1108 is configured to perform speech enhancement on the filter speech excitation signal according to the feature parameters, the long term filtering parameters and the linear filtering parameters to obtain an enhanced speech excitation signal.
The speech synthesis module 1110 is configured to perform speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain an enhanced speech signal.
In the above embodiment, the computer device sequentially decodes the received speech packets to obtain the residual signal, the long term filtering parameters and the linear filtering parameters; filters the residual signal to obtain the audio signal; extracts, in the case that the audio signal is the feedforward error correction frame signal, the feature parameters from the audio signal; converts the audio signal into the filter speech excitation signal based on the linear filtering coefficient obtained by decoding the speech packet; performs the speech enhancement on the filter speech excitation signal according to the feature parameters and the long term filtering parameters obtained by decoding the speech packet to obtain the enhanced speech excitation signal; and performs the speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain the enhanced speech signal, to enhance the audio signal within a short time and achieve better signal enhancement effects, thereby improving the timeliness of audio signal enhancement.
In one embodiment, the speech packet processing module 1102 is further configured to: configure parameters of a long term prediction filter based on the long term filtering parameters, and perform long term synthesis filtering on the residual signal by the parameter-configured long term prediction filter to obtain a long term filtering excitation signal; and configure parameters of linear predictive coding filters based on the linear filtering parameters, and perform linear synthesis filtering on the long term filtering excitation signal by the parameter-configured linear predictive coding filters to obtain the audio signal.
In the above embodiment, the terminal performs the long term synthesis filtering on the residual signal based on the long term filtering parameters to obtain the long term filtering excitation signal; and performs the linear synthesis filtering on the long term filtering excitation signal based on the linear filtering parameters obtained by decoding to obtain the audio signal, and thereby can directly output the audio signal when the audio signal is not the feedforward error correction frame signal, and enhance the audio signal and output the enhanced speech signal when the audio signal is the feedforward error correction frame signal, and improve the timeliness of audio signal outputting.
In one embodiment, the speech packet processing module 1102 is further configured to: split the long term filtering excitation signal into at least two subframes to obtain sub-long term filtering excitation signals; group the linear filtering parameters to obtain at least two linear filtering parameter sets; configure parameters of the at least two linear predictive coding filters respectively based on the linear filtering parameter sets; input the obtained sub-long term filtering excitation signals respectively into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform linear synthesis filtering on the sub-long term filtering excitation signals based on the linear filtering parameter sets to obtain sub-audio signals corresponding to the subframes each; and combine the sub-audio signals in a chronological order of the subframes to obtain the audio signal.
In the above embodiment, the terminal splits the long term filtering excitation signal into the at least two subframes to obtain the sub-long term filtering excitation signals; groups the linear filtering parameters to obtain the at least two linear filtering parameter sets; configures parameters of the at least two linear predictive coding filters respectively based on the linear filtering parameter sets; inputs the obtained sub-long term filtering excitation signals respectively into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform linear synthesis filtering on the sub-long term filtering excitation signals based on the linear filtering parameter sets to obtain sub-audio signals corresponding to the subframes each; and combines the sub-audio signals in the chronological order of the subframes to obtain the audio signal, thereby ensuring the obtained audio signal to be a good reproduction of the audio signal sent by the sending end and improve the quality of the reproduced audio signal.
In one embodiment, the linear filtering parameters include a linear filtering coefficient and an energy gain value. The speech packet processing module 1102 is further configured to: acquire, for the sub-long term filtering excitation signal corresponding to a first subframe in the long term filtering excitation signal, the energy gain value corresponding to a historical sub-long term filtering excitation signal of the subframe in a historical long term filtering excitation signal adjacent to the sub-long term filtering excitation signal corresponding to the first subframe; determine an energy adjustment parameter corresponding to the sub-long term filtering excitation signal based on the energy gain value corresponding to the historical sub-long term filtering excitation signal and the energy gain value of the sub-long term filtering excitation signal corresponding to the first subframe; perform energy adjustment on the historical sub-long term filtering excitation signal based on the energy adjustment parameter; and input the obtained sub-long term filtering excitation signal and the energy-adjusted historical sub-long term filtering excitation signal obtained into the parameter-configured linear predictive coding filter such that the linear predictive coding filter performs linear synthesis filtering on the sub-long term filtering excitation signal corresponding to the first subframe based on the linear filtering coefficient and the energy-adjusted historical sub-long term filtering excitation signal to obtain the sub-audio signal corresponding to the first subframe.
In the above embodiment, the terminal acquires, for the sub-long term filtering excitation signal corresponding to the first subframe in the long term filtering excitation signal, the energy gain value of the historical sub-long term filtering excitation signal of the subframe in the historical long term filtering excitation signal adjacent to the sub-long term filtering excitation signal corresponding to the first subframe; determines the energy adjustment parameter corresponding to the sub-long term filtering excitation signal based on the energy gain value corresponding to the historical sub-long term filtering excitation signal and the energy gain value of the sub-long term filtering excitation signal corresponding to the first subframe; and performs the energy adjustment on the historical sub-long term filtering excitation signal based on the energy adjustment parameter, inputs the obtained sub-long term filtering excitation signal and the energy-adjusted historical sub-long term filtering excitation signal into the parameter-configured linear predictive coding filter, so that the linear predictive coding filter performs the linear synthesis filtering on the sub-long term filtering excitation signal corresponding to the first subframe based on the linear filtering coefficient and the energy-adjusted historical sub-long term filtering excitation signal to obtain the sub-audio signal corresponding to the first subframe, thereby ensuring the obtained each subframe audio signal to be a good reproduction of each subframe audio signal sent by the sending end and improve the quality of the reproduced audio signal.
In an embodiment, as shown in
In the above embodiment, the terminal determines whether the current audio signal obtained by decoding and filtering is the feedforward error correction frame signal by determining whether the historical speech packet decoded before the current speech packet is decoded has data anomalies, and thereby can, if the audio signal is the feedforward error correction frame signal, enhance the audio signal to further improve the quality of the audio signal.
In one embodiment, the feature parameters include a cepstrum feature parameter. The feature parameter extraction module 1104 is further configured to: perform Fourier transform on the audio signal to obtain a Fourier-transformed audio signal; perform logarithm processing on the Fourier-transformed audio signal to obtain a logarithm result; and perform inverse Fourier transform on the logarithm result to obtain the cepstrum feature parameter.
In the above embodiment, the terminal can extract the cepstrum feature parameter from the audio signal, and thereby enhance the audio signal based on the extracted cepstrum feature parameter, and improve the quality of the audio signal.
In one embodiment, the long term filtering parameters include a pitch period and a magnitude gain value. The speech enhancement module 1108 is further configured to: perform speech enhancement on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filtering parameters and the cepstrum feature parameter to obtain the enhanced speech excitation signal.
In the above embodiment, the terminal performs speech enhancement on the filter speech excitation signal according to the pitch period, the magnitude gain value, the linear filtering parameters and the cepstrum feature parameter to obtain the enhanced speech excitation signal, and thereby can enhance the audio signal based on the enhanced speech excitation signal, and improve the quality of the audio signal.
In one embodiment, the signal conversion module 1106 is further configured to: configure parameters of linear predictive coding filters based on the linear filtering parameters, and perform linear decomposition filtering on the audio signal by the parameter-configured linear predictive coding filters to obtain the filter speech excitation signal.
In the above embodiment, the terminal converts the audio signal into the filter speech excitation signal based on the linear filtering parameters, and thereby can enhance the filter speech excitation signal to enhance the audio signal, and improve the quality of the audio signal.
In one embodiment, the speech enhancement module 1108 is further configured to: input the feature parameters, the long term filtering parameters, the linear filtering parameters and the filter speech excitation signal into a pre-trained signal enhancement model such that the signal enhancement model performs the speech enhancement on the filter speech excitation signal based on the feature parameters to obtain the enhanced speech excitation signal.
In the above embodiment, the terminal obtains the enhanced speech excitation signal by the pre-trained signal enhancement model, and thereby can enhance the audio signal based on the enhanced speech excitation signal, and improve the quality of the audio signal and the efficiency of audio signal enhancement.
In one embodiment, the feature parameters include a cepstrum feature parameter. The speech enhancement module 1108 is further configured to: vectorize the cepstrum feature parameter, the long term filtering parameters and the linear filtering parameters, and concatenate the vectorization results to obtain a feature vector; input the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; perform feature extraction on the feature vector by the signal enhancement model to obtain a target feature vector; and enhance the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal.
In the above embodiment, the terminal vectorizes the cepstrum feature parameter, the long term filtering parameters and the linear filtering parameters, and concatenates the vectorization results to obtain the feature vector; inputs the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; performs the feature extraction on the feature vector by the signal enhancement model to obtain the target feature vector; and enhances the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal, and thereby can enhance the audio signal by the signal enhancement model, and improve the quality of the audio signal and the efficiency of audio signal enhancement.
In one embodiment, the speech enhancement module 1108 is further configured to: perform Fourier transform on the filter speech excitation signal to obtain a frequency domain speech excitation signal; enhance a magnitude feature of the frequency domain speech excitation signal based on the target feature vector; and perform inverse Fourier transform on the frequency domain speech excitation signal with the enhanced magnitude feature to obtain the enhanced speech excitation signal.
In the above embodiment, the terminal performs the Fourier transform on the filter speech excitation signal to obtain the frequency domain speech excitation signal; enhances the magnitude feature of the frequency domain speech excitation signal based on the target feature vector; and performs the inverse Fourier transform on the frequency domain speech excitation signal with the enhanced magnitude feature to obtain the enhanced speech excitation signal, and thereby can enhance the audio signal on the premise of keeping phase information of the audio signal unchanged, and improve the quality of the audio signal.
In one embodiment, the speech synthesis module 1110 is further configured to: configure parameters of linear predictive coding filters based on the linear filtering parameters, and perform linear synthesis filtering on the enhanced speech excitation signal by the parameter-configured linear predictive coding filters to obtain the enhanced speech signal.
In this embodiment, the terminal may obtain the enhanced speech signal by performing linear synthesis filtering on the enhanced speech excitation signal to enhance the audio signal, thereby improving the quality of the audio signal.
In one embodiment, the linear filtering parameters include a linear filtering coefficient and an energy gain value. The speech synthesis module 1110 is further configured to: configure parameters of the linear predictive coding filter based on the linear filtering coefficient; acquire an energy gain value corresponding to a historical speech packet decoded before the speech packet is decoded; determine an energy adjustment parameter based on the energy gain value corresponding to the historical speech packet and the energy gain value corresponding to the speech packet; perform energy adjustment on a historical long term filtering excitation signal corresponding to the historical speech packet based on the energy adjustment parameter to obtain an adjusted historical long term filtering excitation signal; and input the adjusted historical long term filtering excitation signal and the enhanced speech excitation signal into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform linear synthesis filtering on the enhanced speech excitation signal based on the adjusted historical long term filtering excitation signal.
In the above embodiment, the terminal configures parameters of the linear predictive coding filter based on the linear filtering coefficient; acquires the energy gain value corresponding to the historical speech packet decoded before the speech packet is decoded; determines the energy adjustment parameter based on the energy gain value corresponding to the historical speech packet and the energy gain value corresponding to the speech packet; performs the energy adjustment on the historical long term filtering excitation signal corresponding to the historical speech packet based on the energy adjustment parameter to obtain the adjusted historical long term filtering excitation signal; and inputs the adjusted historical long term filtering excitation signal and the enhanced speech excitation signal into the parameter-configured linear predictive coding filters such that the linear predictive coding filters perform the linear synthesis filtering on the enhanced speech excitation signal based on the adjusted historical long term filtering excitation signal, such that the audio signals of different frames can be smoothed, thereby improving the quality of the speech formed by the audio signals of different frames.
For a specific limitation on the audio signal enhancement apparatus, refer to the limitation on the audio signal enhancement method above. Details are not described herein again. The modules in the foregoing audio signal enhancement apparatus may be implemented entirely or partially by software, hardware, or a combination thereof. The foregoing modules may be built in or independent of a processor of a computer device in a hardware form, or may be stored in a memory of the computer device in a software form, so that the processor invokes and performs an operation corresponding to each of the foregoing modules.
In an embodiment, a computer device is provided. The computer device may be a server, and an internal structure diagram thereof may be shown in
In an embodiment, a computer device is provided. The computer device may be a terminal, and an internal structure diagram thereof may be shown in
A person skilled in the art may understand that the structure shown in
In an embodiment, a computer device is further provided, including a memory and a processor, the memory storing a computer program, when executed by the processor, causing the processor to perform the steps in the foregoing method embodiments.
In an embodiment, a computer-readable storage medium is provided, storing a computer program, the computer program, when executed by a processor, implementing the steps in the foregoing method embodiments.
In an embodiment, a computer program product or a computer program is provided. The computer program product or the computer program includes computer instructions, the computer instructions being stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, to cause the computer device to perform the steps in the above method embodiments.
A person of ordinary skill in the art may understand that all or some of procedures of the method in the foregoing embodiments may be implemented by a computer program instructing relevant hardware. The computer program may be stored in a non-volatile computer-readable storage medium. When the computer program is executed, the procedures of the foregoing method embodiments may be implemented. Any reference to a memory, a storage, a database, or another medium used in the embodiments provided in this application may include at least one of a non-volatile memory and a volatile memory. The non-volatile memory may include a read-only memory (ROM), a magnetic tape, a floppy disk, a flash memory, an optical memory, and the like. The volatile memory may include a random access memory (RAM) or an external cache. For the purpose of description instead of limitation, the RAM is available in a plurality of forms, such as a static RAM (SRAM) or a dynamic RAM (DRAM).
The technical features in the foregoing embodiments may be combined in different manners to form other embodiments. For concise description, not all possible combinations of the technical features in the embodiment are described. However, provided that combinations of the technical features do not conflict with each other, the combinations of the technical features are considered as falling within the scope recorded in this specification.
The foregoing embodiments only describe several implementations of this application specifically and in detail, but cannot be construed as a limitation to the patent scope of this application. A person of ordinary skill in the art may make various changes and improvements without departing from the ideas of this application, which shall all fall within the protection scope of this application. Therefore, the protection scope of this patent application is subject to the protection scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
202110484196.6 | Apr 2021 | CN | national |
This application is a continuation of PCT Application No. PCT/CN2022/086960, filed on Apr. 15, 2022, which claims priority to Chinese Patent Application No. 2021104841966, filed with the Chinese Patent Office on Apr. 30, 2021 and entitled “AUDIO SIGNAL ENHANCEMENT METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM.” The two applications are both incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/086960 | Apr 2022 | US |
Child | 18076116 | US |