The present invention relates to the processing of digital audio signals, such as speech signals in telecommunication, in particular the decoding of such signals.
Briefly, it will be recalled that a speech signal can be predicted from its recent past (for example from 8 to 12 samples at 8 kHz) using parameters assessed over short windows (10 to 20 ms in this example). These short-term predictive parameters representing the vocal tract transfer function (for example for pronouncing consonants), are obtained by linear prediction coding (LPC) methods. A longer-term correlation is also used to determine periodicities of voiced sounds (for example the vowels) resulting from the vibration of the vocal cords. This involves determining at least the fundamental frequency of the voiced signal, which typically varies from 60 Hz (low voice) to 600 Hz (high voice) according to the speaker. Then a long term prediction (LTP) analysis is used to determine the LTP parameters of a long-term predictor, in particular the inverse of the fundamental frequency, often called “pitch period”. The number of samples in a pitch period is then defined by the relationship Fe/F0 (or its integer part), where:
Fe is the sampling rate, and
F0 is the fundamental frequency.
It will be recalled therefore that the long-term prediction LTP parameters, including the pitch period, represent the fundamental vibration of the speech signal (when it is voiced), while the short-term prediction LPC parameters represent the spectral envelope of this signal.
The set of these LPC and LTP parameters thus resulting from a speech coding is transmitted by blocks to a homologous decoder via one or more telecommunications networks so that the original speech can then be reconstructed.
Within the framework of the communication of such signals by blocks, the loss of one or more consecutive blocks can occur. By the term “block” is meant a succession of signal data which can be for example a frame in mobile radiocommunication, or also a packet for example in communication over internet protocol (IP) or others.
In mobile radiocommunication for example, most predictive synthesis coding techniques, in particular coding of the “code excited linear predictive” (CELP) type, propose solutions for the recovery of erased frames. The decoder is informed of the occurrence of an erased frame, for example by the transmission of a frame erasure information originating from the channel decoder. The recovery of erased frames aims to extrapolate the parameters of the erased frame from one or more previous frames regarded as valid. Certain parameters manipulated or coded by the predictive coders have a high correlation between frames. Typically, this involves long-term prediction LTP parameters, for the voiced sounds for example, and short-term prediction LPC parameters. Due to this correlation, it is much more advantageous to reuse the parameters of the last valid frame in order to synthesize the erased frame, than to use random, even erroneous, parameters.
In standard fashion, for generating CELP excitation, the parameters of the erased frame are obtained as follows.
The LPC parameters of a frame to be reconstructed are obtained from the LPC parameters of the last valid frame, by simple copying of the parameters or also with introduction of a certain damping (technique used for example in the G723.1 standardized coder). Then, a voicing or a non-voicing is detected in the speech signal in order to determine a degree of harmonicity of the signal at the erased frame.
If the signal is non-voiced, an excitation signal can be randomly generated (by taking a code word from the past excitation, by slight damping of the gain of the past excitation, by random selection in the past excitation, or by using further transmitted codes which can be totally erroneous).
If the signal is voiced, the pitch period (also called “LTP delay”) is generally that calculated for the previous frame, optionally with a slight “jitter” (increase in the value of the LTP delay for the consecutive error frames, the LTP gain being taken to be very close to 1 or equal to 1). The excitation signal is therefore limited to the long-term prediction carried out from a past excitation.
The means of concealment of the erased frames, at decoding, are generally strongly linked to the structure of the decoder and can be common to modules of this decoder, such as for example the signal synthesis module. These means also use intermediate signals available within the decoder, such as for example the past excitation signal stored during the processing of the valid frames preceding the erased frames.
Certain techniques used to conceal the errors produced by packets lost during the transport of data coded according to a time-type coding frequently rely on waveform substitution techniques. Such techniques aim to reconstitute the signal by selecting portions of the decoded signal before the lost period, and do not implement synthesis models. Smoothing techniques are also used to avoid the artefacts produced by the concatenation of different signals.
For the decoders operating on signals coded by transform coding, the techniques for reconstructing erased frames generally rely on the structure of the coding used. Certain techniques aim to regenerate the lost transformed coefficients from the values taken by these coefficients before the erasure.
Other techniques for concealment of the erased frames have been developed jointly with the channel coding. They make use of information provided by the channel decoder, for example information relating to the degree of reliability of the parameters received. It is noted here that conversely, the subject of the present invention does not presuppose the existence of a channel coder.
In Combescure et al.:
“A 16.24.32 kbit/s Wideband Speech Codec Based on ATCELP”, P. Combescure, J. Schnitzler, K. Ficher, R. Kirchherr, C. Lamblin, A. Le Guyader, D. Massaloux, C. Quinquis, J. Stegmann, P. Vary, ICASSP (1998) Conference Proceedings,
a proposal was made for the use of an erased-frame concealment method equivalent to that used in CELP coders for a transform coder.
The drawbacks of this method were the introduction of audible spectral distortions (“synthetic” voice, unwanted resonances, etc.). These drawbacks were due in particular to the use of poorly-controlled long-term synthesis filters (single harmonic component in voiced sounds, use of portions of the past residual signal in non-voiced sounds). Moreover, the energy control is carried out here at the excitation signal level and the energy target of this signal is kept constant for the whole duration of the erasure, which also generates troublesome audible artefacts.
In FR-2.813.722, a technique is proposed for concealment of the erased frames which does not generate greater distortion at higher error rates and/or for longer erased intervals. This technique aims to avoid the excess periodicity for the voiced sounds and to improve control of the generation of the unvoiced excitation. To this end, the excitation signal (if voiced) is regarded as the sum of two signals:
The main problem of the error concealment technique hitherto used in CELP coders resides in the generation of the voiced excitation which, when several consecutive frames have been lost, can result in an overvoicing effect due to the repetition of the same pitch period over several frames.
The present invention offers an improvement on the situation.
To this end it proposes a method for synthesizing a digital audio signal represented by consecutive blocks of samples, in which on receiving such a signal, in order to replace at least one invalid block, a replacement block is generated from the samples of at least one valid block preceding the invalid block.
The method according to the invention comprises the following steps:
The purpose of this inversion of samples, which therefore consists of a very simple manipulation of samples which has a low cost in terms of computation and processing means, is to “break” an over-harmonicity which may have been present if a simple copying of pitch period was used.
Thus, among the advantages offered by the present invention, its implementation requires only a very low computation cost.
Advantageously, the invention can be applied to the case where the digital audio signal is a voiced speech signal and more particularly, weakly voiced, as simple copying of the pitch period produces mediocre results in this case. Thus, according to an advantageous feature, a degree of voicing is detected in the speech signal and steps a) to d) are applied if the signal is at least weakly voiced.
The present invention advantageously relies on the fundamental frequency of the digital audio signal to constitute the groups in step b). Thus, advantageously, in step a):
Of course, in the case of a speech signal, the operation a1) can consist of detecting a voicing and the operation a2) would involve, if the speech signal is voiced, selecting a number of samples which extends over a whole pitch period (inverse of a fundamental frequency of a voice tone). Nonetheless, it will be shown that this realization can also involve a signal other than a speech signal, in particular a musical signal, if a fundamental frequency specific to an overall music tone can be detected therein.
In an embodiment, the fragmentation of step b) is carried out by groups of two samples, and the positions of the samples of a single group can be inverted one with the other.
However, in this embodiment, it is appropriate to distinguish the case where the pitch period (or more generally the inverse period of the fundamental frequency) comprises an even or odd number of samples. In particular, if the number of samples comprised by the period of the detected tone is an even number, an odd number of samples (preferentially a single sample) is advantageously added to or subtracted from the samples of said period in order to form the selection of step a).
It is also appropriate to specify what is meant by the “predetermined rules of inversion”. These rules, which can be chosen according to the characteristics of the signal received, in particular impose the number of samples per group at step b) and the manner of inverting the samples in a group. In the above embodiment, groups of two samples and a simple inversion of the respective positions of these two samples are provided. However, other configurations are possible (groups comprising more than two samples and permutation of all the samples of such groups). Moreover, the inversion rules can also set the number of groups in which the inversion is carried out. A particular embodiment consists of randomizing the instances of sample inversion in each group and setting a probability threshold for inverting, or not inverting, the samples of a group. This probability threshold can have a fixed value, or also a variable value and depend advantageously on a correlation function relating to the pitch period. In this case, the formal determination of the pitch period itself is not necessary. Moreover, more generally, the processing within the meaning of the invention can also be carried out if the valid signal received is simply non-voiced, in which case there is no actual detectable pitch period. In this case, it can be provided to set a given arbitrary number of samples (for example two hundred samples) and carry out the processing within the meaning of the invention on this number of samples. It is also possible to take the value corresponding to the maximum of the correlation function by limiting the search to a value interval (for example between MAX_PITCH/2 and MAX_PITCH, where MAX_PITCH is the maximum value in the pitch period search).
The present invention, which thus proposes the attenuation of overvoicing, offers the following advantages:
Moreover, further advantages and features of the invention will become apparent on examination of the detailed description given by way of example hereafter, and of the attached drawings in which:
a illustrates the application of the systematic inversion of
b represents, purely by way of illustration, the application of the systematic inversion of
c illustrates the application of the systematic inversion of
Firstly, reference is made to
On the other hand, if the loss of one or more consecutive blocks is noted (arrow N at the output of test 50), the degree of voicing of the signal is then detected (test 51).
If the signal is non-voiced (arrow N at the output of test 51), the lost blocks are replaced for example by an audible white noise, called “comfort noise” 52, and the gain 61 of the samples of the blocks thus reconstructed is adjusted. A control can for example be carried out on the energy of the reconstructed signal So, with adaptation of the evolution law, and/or make the parameters of the model change to a rest signal such as the comfort noise 52.
In a variant of the present invention, only two classes of signals are considered, the voiced signals on the one hand, and the weakly voiced or non-voiced signals on the other hand. The advantage of this variant is that the generation of the non-voiced signal will be identical to the weakly voiced synthesis. As indicated previously, the “pitch period” used for the non-voiced signals is a random value, preferably quite large (for example two hundred samples). In a non-voiced block, the previous signal is non-harmonic; by applying the processing within the meaning of the invention to a sufficiently large period, it can be guaranteed that the signal thus generated remains non-harmonic. The nature of the signal will advantageously be retained, which would not be the case when using a randomly-generated signal (for example a white noise).
If the signal is highly voiced (arrow Y at the output of test 51), the lost blocks are replaced by copying the pitch period T. Thus the pitch period T identified in the last still valid part of the received signal Si is determined (using any technique 53 which can be known per se). The samples of this pitch period T are then copied into the lost blocks (reference 54). Then, an appropriate gain 61 is applied to the samples thus replaced (in order to carry out for example an attenuation or “fading”).
In the example described, if the signal is averagely voiced (or, in a less sophisticated but more general variant, if the signal is simply voiced), the method within the meaning of the invention is applied (arrow A at the output of test 51 concerned with the degree of voicing).
With reference to
With reference in particular to
In
Returning to the description of the embodiment illustrated in
In the case of
On the other hand, in the case illustrated in
This problem can be overcome by modifying the number of samples to be inverted per group (and taking for example an odd number of samples per group).
However, a further embodiment is illustrated in
Again with reference to
As previously indicated with reference to
Usually, in a simple copying of the pitch period, the voiced excitation is calculated according to a formula of the type:
s(n)=gltp·s(n−T) (1)
where T is the estimated pitch period and gltp is a chosen LTP gain.
In an embodiment of the invention, the voiced excitation is calculated per group of two samples and with random inversion according to the processing hereafter. Firstly, a random number x is generated in the interval [0; 1], Then, according to the value of x:
s(n)=gltp·s(n−T+1) (2)
s(n+1)=gltp·s(n−7) (3)
The value p represents the probability of inverting the two samples s(n) and s(n+1). For example, the value p can be set such that p=50%.
In an advantageous variant, a variable probability can also be chosen, for example in the form:
p=corr (4)
where the variable con corresponds to the maximum value of the correlation function over the pitch period, marked Corr(T). For a pitch period T, the correlation function Corr(T) is calculated using only 2*Tm samples at the end of the stored signal, and:
where m0 . . . mLmem-1 are the last samples of the previously decoded signal and are still available in the decoder memory.
From this formula, it will be understood that the length of this memory Lmem (in number of samples stored) must be equal to at least twice the maximum value of the duration of the pitch period (in number of samples). In order to take into account the lowest voices (lowest fundamental frequency of the order of 50 Hz), the number of samples to be stored can be of the order of 300, for a low narrowband sampling rate and more than 300 for higher sampling rates.
The correlation function corr(T), given by the formula (5), reaches a maximum value when the variable T corresponds to the pitch period T0 and this maximum value gives an indication of the degree of voicing. Typically, if this maximum value is very close to 1, then the signal is highly voiced. If it is close to 0, the signal is not voiced.
It will thus be understood that in this embodiment, the prior determination of the pitch period is not necessary for constructing the groups of samples to be inverted. In particular, the determination of the pitch period T0 can be carried out jointly with the constitution of the groups within the meaning of the invention, by applying the formula (5) above.
If the signal is highly voiced, then the probability p will be very high, and the voicing will be retained in accordance with the calculation according to the formula (1). If, on the other hand, the voicing of the signal Si is not very marked, the probability p will be lower and advantageously the equations (2) and (3) are used.
Of course, other correlation calculations can also be used.
For example, it is also possible of calculate the harmonic excitation according to predefined classes. For the highly voiced classes, the equation (1) is preferably used. For the averagely or weakly voiced classes, the equations (2) and (3) are preferably used. For the non-voiced classes, no harmonic excitation is generated and the excitation can then be generated from a white noise. However, in the previously described variant, the equations (2) and (3) are also used with a sufficiently large arbitrary pitch period.
More generally, the present invention is not limited to the embodiments described above by way of example; it extends to other variants.
In the context of the embodiment of the invention described in detail above, the excitation generation in coding by CELP predictive synthesis aims to avoid overvoicing in the context of frame transmission error concealment. It can nevertheless be envisaged to use the principles of the invention for band extension. It is then possible to use the generation of an extended-bandwidth excitation in a band extension system (with or without data transmission), based on a model of the CELP (or CELP sub-band) type. High-band excitation can then be calculated as described previously, which then makes it possible to limit the over-harmonicity of this excitation.
Moreover, the implementation of the invention is particularly suitable for frame or packet transmission of signals over networks, for example “voice over internet protocol (VOIP)”, in order to provide an acceptable quality over IP when such packets are lost, while nevertheless guaranteeing a limited complexity.
Of course, the inversion of the samples can be carried out on groups of samples of a size greater than two.
Moreover, the generation of a replacement block for an invalid block from samples of a valid block preceding the invalid block has been described above. In a variant, it is possibly to rely instead on a valid block succeeding the invalid block in order to carry out the synthesis of the invalid block (a posteriori synthesis). This implementation can be advantageous, in particular for synthesizing several successive invalid blocks and in particular for synthesizing:
The present invention also involves a computer program intended to be stored in the memory of a digital audio signal synthesis device. This program then comprises instructions for the implementation of the method within the meaning of the invention, when it is executed by a processor of such a synthesis device. Moreover, the previously-described
Moreover, the present invention also involves a digital audio signal synthesis device constituted by a succession of blocks. This device could further comprise a memory storing the above-mentioned computer program. With reference to
The synthesis device SYN within the meaning of the invention comprises means such as a working storage memory MEM (or memory for storing the above-mentioned computer program) and a processor PROC cooperating with this memory MEM, for implementation of the method within the meaning of the invention, and thus for synthesizing the current block starting from at least one of the preceding blocks of the signal Si.
The present invention also involves a device for receiving a digital audio signal constituted by a succession of blocks, such as a decoder of such a signal for example. Again with reference to
Number | Date | Country | Kind |
---|---|---|---|
0609225 | Oct 2006 | FR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FR07/52188 | 10/17/2007 | WO | 00 | 6/24/2009 |