The present invention relates to signal processing field, and more particularly to a signal processing method, processing apparatus and a voice decoder.
In a real-time voice communication system, voice data is required to be transmitted in time and reliably, such as a VoIP (Voice over IP) system. However, because of unreliability of the network system itself, during the transmitting process from a transmitter to a receiver, the data packet may be dropped or can not arrive on the destination in time. The two situations are considered as network packet loss by the receiverer. The network packet loss is unavoidable, and is one of the principal factors influencing the quality of voice communication. Therefore, in the real-time voice communication system, a forceful packet loss concealment method is needed to restore a lost data packet and to get good quality of voice communication under the situation that the network packet loss happens.
In prior real time voice communication technologies, at the transmitter, a coder divides a broadband voice into two sub-bands, a high-band and a low-band, encodes the two sub-bands respectively using Adaptive Differential Pulse Code Modulation (ADPCM), and sends the two encoded sub-bands to the receiver via the network. At the receiver, the two sub-bands are decoded by an ADPCM decoder respectively, and are synthesized to a final signal by a Quadrature Mirror Filter (QMF)
For two different sub-bands, different Packet Loss Concealment (PLC) methods are used. For the low-band signal, when there is no packet loss, a reconstructed signal does not change during cross-fading. When there is packet loss, a short-term predictor and a long-term predictor are used to analyze a past signal (the past signal in the present application means the voice signal before a lost frame), and a voice class information is extracted. And the signal of the lost frame is reconstructed by taking the method for Linear Predictive Coding (LPC) based on pitch repetition, and by using the predictors and the voice class information. The state of the ADPCM should be updated synchronously until a good frame appears. In addition, not only the corresponding signal of the lost frame should be generated, but also a signal for cross-fading should be generated. And once a good frame is received, cross-fading can be executed to the signal of the good frame and the said signal. It should be noted that the cross-fading only happens when a good frame is received after a frame loss by the receiver.
During the process of implementing the present invention, the inventor finds that there exist the following problems in the prior arts: the reconstructed signal of the lost frame is synthesized using the past signal. The waveform and the energy are more similar to the signal in the history buffer, namely the signal before the lost frame, even at the end of the synthesized signal, but not similar to the signal newly decoded. This may cause that a waveform sudden change or an energy sudden change of the synthesized signal occurs at the joint between the lost frame and the first frame following the lost frame. The sudden change is shown in
Embodiments of the present invention provide a signal processing method adapted to process a synthesized signal in packet loss concealment to make the waveform of a joint between a lost frame and a first frame in the synthesized signal have a smooth transmitting.
The embodiments of the present invention provide a signal processing method adapted to process a synthesized signal in packet loss concealment, including:
receiving a good frame following a lost frame, obtaining an energy ratio of the energy of a signal of the good frame to the energy of a synthesized signal corresponding to the same time of the good frame; and
adjusting the synthesized signal in accordance with the energy ratio.
The embodiments of the present invention also provide a signal processing apparatus adapted to process a synthesized signal in packet loss concealment, including:
a detecting module, configured to notify an energy obtaining module when detecting that a frame following a lost frame is a good frame;
the energy obtaining module, configured to obtain an energy ratio of the energy of the signal of the good frame to the energy of the synthesized signal corresponding to the same time of the good frame when receiving the notification sent by the detecting module; and
a synthesized signal adjustment module, configured to adjust the synthesized signal in accordance with the energy ratio obtained by the energy obtaining module.
The embodiments of the present invention also provide a voice decoder adapted to decode a voice signal, including a low-band decoding unit, a high-band decoding unit and a quadrature mirror filter unit.
The low-band decoding unit is configured to decode a received low-band decoding signal and compensate a lost low-band signal frame.
The high-band decoding unit is configured to decode received high-band decoding signal and compensate a lost high-band signal frame.
The quadrature mirror filter unit is configured to synthesize the decoded low-band decoding signal and the decoded high-band decoding signal to obtain a final output signal.
The low-band decoding unit includes a low-band decoding sub-unit, a pitch-repetition-based linear predictive coding sub-unit, a signal processing sub-unit and a cross-fading sub-unit.
The low-band decoding sub-unit is configured to decode a received low-band code stream signal.
The pitch-repetition-based linear predictive coding sub-unit is configured to generate a synthesized signal corresponding to a lost frame.
The signal processing sub-unit is configured to receive a good frame following a lost frame, obtain an energy ratio of the energy of the signal of the good frame to the energy of the synthesized signal corresponding to the same time of the good frame, and adjust the synthesized signal in accordance with the energy ratio.
The cross-fading sub-unit is configured to cross-fade the signal decoded by the low-band decoding sub-unit and the signal after energy adjusting by the signal processing sub-unit.
The embodiments of the present invention also provide a computer program product including computer program code. The computer program code can make a computer execute any step in the signal processing method in packet loss concealment when the program code is executed by the computer.
The embodiments of the present invention also provide a computer readable medium storing computer program code. The computer program code can make a computer execute any step in the signal processing method in packet loss concealment when the program code is executed by the computer.
Compared with the prior art, the embodiments of the present invention have the following advantages:
The synthesized signal is adjusted in accordance with the energy ratio of the energy of the first good frame following the lost frame to the energy of the synthesized signal to ensure that there is not a waveform sudden change or an energy sudden change at the place where the lost frame and the first good frame following the lost frame are jointed in the synthesized signal, to realize the waveform's smooth transition and to avoid music noises.
Embodiments of the present invention are described in more detail combining with the accompanying drawings.
A first embodiment of the present invention provides a signal processing method adapted to process a synthesized signal in packet loss concealment. As shown in
Step s101, a frame following a lost frame is detected as a good frame.
Step s102, an energy ratio of the energy of a signal of the good frame to the energy of the synchronized synthesized signal is obtained.
Step s103, the synthesized signal is adjusted in accordance with the energy ratio.
In the Step s102, the “synchronized synthesized signal” means the synthesized signal corresponding to the same time of the good frame. The “synchronized synthesized signal” that appears in other parts of the present application can be understood in the same way.
The signal processing method in the first embodiment of the present invention is described combining with specific applying cases as follows.
In the first embodiment of the present invention, a signal processing method is provided that is adapted to process the synthesized signal in packet loss concealment. The principal schematic diagram is shown in
In the case that a current frame is not lost, a low-band ADPCM decoder decode the received current frame to obtain a signal xl(n),n=0, . . . , L−1, and an output corresponding to the current frame is zl(n), n=0, . . . , L−1. In this condition, the reconstructed signal is not changed when cross-fading. That is:
zl[n]=xl[n], n=0, . . . ,L−1
wherein the L is the frame length.
In the case of that a current frame is lost, a synthesized signal yl′(n),n=0, . . . L−1 that is corresponding to the current frame is generated by using the method of linear predictive coding based on pitch repetition. According to whether a next frame following the current frame is lost or not, different processing is executed:
When the next frame following the current frame is lost:
Under this condition, an energy scaling processing is not executed for the synthesized signal. The output signal corresponding to the first lost frame zl(n),n=0, . . . , L−1 is the synthesized signal yl′(n),n=0, . . . L−1 that is zl[n]=yl[n]=yl′[n], n=0, . . . , L−1.
When the next frame following the current frame is not lost:
Suppose when the energy scaling is executed, the good frame (that is the next frame following the first lost frame) being used is the good frame xl(n),n=L, . . . , L+M−1, which is obtained after the being decoded by the ADPCM decoder, wherein M is the number of the signal samples when the energy is calculated. The synthesized signal used which is corresponding to the same time of the signal of the good frame is the signal yl′(n),n=L, . . . L+M−1 which is generated by linear predictive coding based on pitch repetition. The yl′(n),n=0. . . . L+N−1 is scaled in energy to obtain the signal yl(n),n=0, . . . L+N−1, which can match the signal xl(n),n=L, . . . L+N−1 in energy, wherein N is the signal length of cross-fading. The output signal zl(n),n=0, . . . L−1 corresponding to the current frame is:
zl(n)=yl(n),n=0, . . . , L−1.
The xl(n), n=L, . . . , L+N−1 is updated as the signal zl(n) obtained by the cross-fading of the xl(n),n=L, . . . , L+N−1 and the yl(n),n=L, . . . L+N−1.
The method of linear predictive coding based on pitch repetition involved in
Before encountering a lost frame, zl(n) is stored in a buffer for future use, when a frame received is a good frame.
When a first lost frame appears, two steps are required to synthesize the final signal yl′(n) Firstly, the past signal zl(n) n=−Q, . . . −1 is analyzed, and then the signal yl′(n) is synthesized combining with the analysis result, wherein Q is the needed length of the signal when analyzing the past signal.
The module for linear predictive coding based on pitch repetition specifically comprises the following parts:
The short-term analysis A(z) and synthesis filters 1/A(z) are based on P-order LP filters. The LP analysis filter is defined as:
A(z)=1+a1 z−1+a2 z−2+ . . . +aP z−P
After the LP analysis of the filter A(z), the residual signal e(n), n=−Q, . . . , −1 corresponding to the past signal zl(n), n=−Q, . . . , −1 is obtained using the following formula:
The method for pitch repetition is used for compensating the lost signal. Therefore, a pitch period T0 corresponding to the past signal zl(n),n=−Q, . . . , −1 needs to be estimated. Detail steps are as follows: Firstly, zl(n) are pre-processed to remove a low frequency part which is needless in the Long Term Prediction (LTP) analysis, then the pitch period T0 of the zl(n) could be obtained by LTP analysis; and the voice class could be obtained combining with a signal class module, after that the pitch period T0 is obtained.
The voice classes are shown in table 1:
A pitch repetition module is used for estimating the LP residual signal e(n), n=0, . . . , L−1 corresponding to the lost frame. Before pitch repetition, if the voice class is not VOICED, the magnitude of each sample will be limited by the following formula:
If the voice class is VOICED, the residual e(n), n=0. . . . , L−1 corresponding to the lost signal will be obtained by repeating the residual signal corresponding to the last pitch period in a newly received signal of a good frame, that is:
e(n)=e(n−T0).
For other voice classes, in order to avoid the periodicity of the generated data being too strong(for the UNVOICED signal, if the periodicity is too strong, it will sound like music noises or other uncomfortable noises), the following formula is used to generate the residual signal e(n), n=0, . . . , L−1 corresponding to the lost signal:
e(n)=e(n−T0+(−1)n).
Besides generating the residual signal corresponding to the lost frame, in order to ensure a smooth joint between the lost frame and the first good frame following the lost frame, the residual signal e(n), n=L, . . . , L+N−1, of additional N sample will be generated continually to generate a signal for cross-fading.
After generating the residual signal e(n) corresponding to the lost frame and the signal for cross-fading, the reconstructed signal of the lost frame is given by:
wherein e(n), n=0, . . . , L−1, is the residual signal obtained in the pitch repetition. In addition, N samples of ylpre(n),n=L, . . . , L+N−1 are generated using the above formula; these samples are used for cross-fading.
The energy of the ylpre(n) is controlled according to different voice classes provided in Table 1. That is:
yl′(n)=gmute(n)×ylpre(n),n=0, . . . , L+M−1,gmute(n)ε[0 1]
where gmute(n) corresponds to a muting factor corresponding to each sample. The value of gmute(n) changes in accordance with different voice classes and the situation of the packet loss. An example is given as follows:
For those voices with large energy variation, for example plosives, corresponding to the voice with TRANSIENT class and VUV_TRANSITION class in Table 1, the speed for fading may be a little high. For those voices with small energy variation, the speed for fading may be a little low. To describe conveniently, it is assumed that a signal of 1 ms includes R samples.
Specifically, for the voice with TRANSIENT class, within 10 ms (totally S=10*R samples), making gmute(−1), gmute(n) fades from 1 to 0. gmute(n) corresponding to samples after 10 ms is 0, which can be shown using a formula as:
For the voice with VUV_TRANSITION class, the fading speed within the initial 10 ms may be a little low, and the voice fades to 0 quickly within the following 10 ms, which can be shown using formula as:
For the voice of other classes, the fading speed within the initial 10 ms may be a little low, the fading speed within the following 10 ms may be a little higher, and the voice fades to 0 quickly within the following 20 ms, which can be shown using formula as below:
The energy scaling in
The detailed method for executing energy scaling to yl′(n),n=0, . . . , L+N−1 according to xl(n),n=L, . . . , L+M−1 and yl′(n),n=L, . . . , L+M−1 includes the following steps, referring to
Step s201, an energy E1 corresponding to the synthesized signal yl′(n),n=L, . . . L+M−1 and an energy E2 corresponding to the signal xl(n),n=L, . . . , L+M−1 are calculated respectively.
Concretely,
where M is the number of the signal samples when the energy is calculated. The value of M could be set flexibly according to specific cases. For example, under the circumstances that the frame length being a little short, such as the frame length L being shorter than 5 ms, M=L is recommended; under the circumstances that the frame length is a little long and the pitch period is shorter than one frame length, M could be set as a corresponding length of one pitch period signal.
Step s202, the energy ratio R of E1 to E2 is calculated.
Concretely,
where the function sign( ) is a symbolic function, and it is defined as follows:
Step s203, the magnitude of the signal yl′(n),n=0, . . . L+N−1 is adjusted in accordance with the energy ratio R.
Concretely,
where N is a length used for cross-fading by the current frame. The value of N could be set flexibly according to specific cases. Under this circumstance that the frame length is a little short, N could be set as the length of one frame, that is N=L.
In order to avoid appearing the circumstance of energy magnitude overflowing (the energy magnitude exceeds the allowable maximum value of the corresponding magnitudes of the samples) when E1<E2 using the above method, the above formula is only used to fade the signal yl′(n),n=0, . . . L+N−1 when E1>E2.
When the previous frame is a lost frame and the current frame is also a lost frame, the energy scaling need not be executed to the previous frame, that is the yl(n) corresponding to the previous frame is:
yl(n)=yl′(n) n=0, . . . , L−1.
The cross-fading in
In order to realize a smooth energy transition, after that yl(n),n=0, . . . L+N−1 is generated through executing energy scaling by the synthesized signal yl′(n),n=0, . . . L+N−1, the low-band signals need to be processed by cross-fading. The rule is shown in Table 2.
n = 0, . . . , N − 1 and zl(n) = xl(n), n = N, . . . , L − 1
In the Table 2, zl(n) is the signal which corresponds to the signal corresponding to the current frame outputted finally. xl(n) is the signal of the good frame corresponding to the current frame. yl(n) is a synthesized signal at the same time corresponding to the current frame.
The schematic diagram of the above processes is shown in
The first row is an original signal. The second row is the synthesized signal shown as a dashed line. The downmost row is an output signal shown as a dotted line, which is the signal after energy adjustment. The frame N is a lost frame, and the frame N−1 and N+1 are both good frames. Firstly, the energy ratio of the energy of the received signal of frame N+1 to the energy of the synthesized signal corresponding to the frame N+1 is calculated, and then the synthesized signal fades in accordance with the energy ratio, to obtain the output signal in the downmost row. The method for fading may refer to the above step s203. The processing of cross-fading is executed at last. For the frame N, an output signal after fading of the frame N is taken as the output of the frame N (it is supposed herein that the output of the signal is allowed to have at least a delay of one frame, that is, the frame N could be outputted after that the frame N+1 is inputted). For the frame N+1, according to the principle of cross-fading, the output signal of the frame N+1 after fading with a descent window multiplied by, is superposed on the received original signal of the frame N+1 with a ascent window multiplied by. The signal obtained by superposing is taken as the output of the frame N+1.
In a second embodiment of the present invention, a signal processing method is provided which is adapted to process the synthesized signal in packet loss concealment. The difference between the processing methods of the first embodiment and the second embodiment is that in the above first embodiment, when the method based on the pitch period is used to synthesize the signal yl′(n), the status of phase discontinuousness may occur, as shown in
As shown in
The signal is synthesized based on pitch repetition in
Through using the signal processing method provided by the embodiments of the present invention, the synthesized signal is adjusted in accordance with the energy ratio of the energy of the first good frame following the lost frame to the energy of the synthesized signal to ensure that there is not a waveform sudden change or an energy sudden change at the place where the lost frame and the first frame following the lost frame are jointed for the synthesized signal, which realizes the waveform's smooth transiting and to avoid music noises.
A third embodiment of the present invention also provides an apparatus for signal processing which is adapted to process the synthesized signal in packet loss concealment. The structure schematic diagram is shown in
a detecting module 10, configured to notify an energy obtaining module 30 when detecting a next frame following a lost frame is a good frame;
the energy obtaining module 30, configured to obtain an energy ratio of the energy of the good frame signal to the energy of the synchronized synthesized signal when receiving the notification sent by the detecting module 10;
a synthesized signal adjustment module 40, configured to adjust the synthesized signal in accordance with the energy ratio obtained by the energy obtaining module 30.
Concretely, the energy obtaining module 30 further includes:
a good frame signal energy obtaining sub-module 21, configured to obtain the energy of the good frame signal;
a synthesized signal energy obtaining sub-module 22, configured to obtain the energy of the synthesized signal; and
an energy ratio obtaining sub-module 23, configured to obtain the energy ratio of the energy of the good frame signal to the energy of the synchronized synthesized signal.
In addition, the apparatus for signal processing also comprises:
a phase matching module 20, configured to execute phase matching to the synthesized signal inputted and send the synthesized signal after phase matching to the energy obtaining module 30, shown in
Furthermore, as shown in
A specific applying case of the processing apparatus in the third embodiment of the present invention is shown in
zl[n]=xl[n], n=0, . . . , L−1
where L is the frame length.
In the case that the current frame is lost, a synthesized signal yl′(n),n=0, . . . L−1 that is corresponding to the current frame is generated by using the method of linear predictive coding based on pitch repetition. According to whether a next frame following the current is lost or not, different processing is executed:
When the next frame following the current frame is lost:
In this condition, the apparatus for signal processing in the embodiments of the invention does not process the synthesized signal yl′(n),n=0, . . . L−1. The output signal zl(n),n=0, . . . , L−1 corresponding to a first lost frame is the synthesized signal yl′(n),n=0, . . . L−1 that is zl[n]=yl[n]=yl′[n], n=0, . . . , L−1.
When the next frame following the current frame is not lost:
When the synthesized signal yl′(n),n=0, . . . L+N−1 is processed by using the apparatus for signal processing in the embodiments of the invention, the good frame (that is the next frame following the first lost frame) being used is the good frame xl(n),n=L, . . . , L+M−1 obtained after the decoding of the ADPCM decoder, wherein M is the number of the signal samples when calculating the energy. The synthesized signal being used which is corresponding to the same time of the good signal is the signal yl′(n),n=L, . . . L+M−1 which is generated by linear predictive coding based on pitch repetition. The yl′(n),n=0, . . . L+N−1 is processed to obtain the signal yl(n),n=0, . . . L+N−1, which can match the signal xl(n),n=L, . . . , L+N−1 in energy, wherein N is the signal length for executing cross-fading. The output signal zl(n),n=0, . . . , L−1 corresponding to the current frame is:
zl(n)=yl(n),n=0, . . . , L−1.
xl(n),n=L, . . . , L+N−1 is updated to the signal zl(n), which is obtained by the cross-fading of the xl(n),n=L, . . . , L+N−1 and the yl(n),n=L, . . . L+N−1.
Through using the apparatus for signal processing provided by the embodiments of the present invention, the synthesized signal is adjusted in accordance with the energy ratio of the energy of the first good frame following the lost frame to the energy of the synthesized signal to ensure that there is not a waveform sudden change or an energy sudden change at the place where the lost frame and the first frame following the lost frame are jointed for the synthesized signal, which realizes the waveform's smooth transition and to avoid music noises.
A fourth embodiment of the present invention provides a voice decoder, as shown in
For the low-band decoding unit 60, as shown in
The low-band decoding sub-unit 62 decodes a received low-band signal. The pitch-repetition-based linear predictive coding sub-unit 61 obtains a synthesized signal by linear predictive coding to the lost low-band signal frame. The signal processing sub-unit 63 adjusts the synthesized signal to make the energy magnitude of the synthesized signal consistent with the energy magnitude of the decoded signal processed by the low-band decoding sub-unit 62, and to avoid the appearance of music noises. The cross-fading sub-unit 64 cross-fades the decoded signal processed by the low-band decoding sub-unit 62 and the synthesized signal adjusted by the signal processing sub-unit 63 to obtain the final decoded signal after lost frame compensation.
The structure of the signal processing sub-unit 63 has three different forms corresponding to schematic structural diagrams of the signal processing apparatus shown in
Through description of above embodiments, the skilled person in the art could clearly understand that the present invention could be accomplished by using software and required general hardware platform, or by hardware, but the former is a better embodiment in many cases. Based on such understanding, the substantial matter in the technical solution of the present invention or the part contributing to the prior art could be realized in form of software products. The software products of the computer is stored in a storage medium and they comprise a number of instructions for making an apparatus execute the method described in each embodiment of the present invention.
Though illustration and description of the present disclosure have been given combining with preferred embodiments thereof, it should be appreciated by persons of ordinary skill in the art that various changes in forms and details can be made without deviation from the spirit and scope of this disclosure, which are defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
200710169616.1 | Nov 2007 | CN | national |
This application is a continuation application of a U.S. application Ser. No. 12/264,557, filed Nov. 4, 2008, entitled “SIGNAL PROCESSING METHOD, PROCESSING APPARTUS AND VOICE DECODER,” by Wuzhou ZHAN et al., which itself claims the priority from the Chinese patent application No. “200710169616.1” submitted with the State Intellectual Property Office of P.R.C. on Nov. 5, 2007 entitled “METHOD AND APPARATUS FOR SIGNAL PROCESSING”, the contents of which are incorporated herein in entirety by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 12264557 | Nov 2008 | US |
Child | 12539158 | US |