The embodiments discussed herein relate to a voice waveform interpolating apparatus, for example, a voice waveform interpolating apparatus used when reproducing, in a receiving side, a voice waveform corresponding to a voice packet lost during transmission of voice packets in a packet communication system. The embodiments further relate to, for example, a voice waveform interpolating apparatus useable in voice editing or processing systems such as ones editing or processing data of stored phoneme pieces to generate new voice data.
Note that in the following, the voice packet communication system of the former embodiments will be explained as an example.
In recent years, due to the spread of the Internet, so-called VoIP (Voice over IP) communication systems transmitting voice data packetized into voice packets through an IP (Internet Protocol) network have been rapidly spreading in use.
If part of the voice packets to be received is lost or dropped in an IP network transmitting PCM data packet units as above, the voice quality of the voice reproduced by the voice packets will deteriorate. Therefore, a variety of methods for preventing as much as possible the user from noticing the deterioration in the voice quality caused by the loss etc. of voice packets have been proposed in the past.
As one voice packet loss concealment method, there is already known the ITU-T (International Telecommunication Union) Recommendation G.711 Appendix I. In the packet loss concealment method stipulated in the G.711 Appendix I, first, the pitch period, a physical property of voice, is extracted using waveform correlation. The extracted pitch pattern is repeatedly arranged at the parts corresponding to the lost voice packets to generate a loss concealment signal. Note that the loss concealment signal is made to gradually attenuate when voice packet loss occurs continuously.
Further, several interpolated reproduction methods for voice loss have been proposed. For example, there are the following Patent Literature 1 to Patent Literature 3.
Patent Literature 1 discloses a method of imparting fluctuations in pitch period and power fluctuations, estimated from voice data that had been normally received prior to packet loss, to generate a loss concealment signal. Further, Patent Literature 2 discloses a method of referring to at least one of the packets before packet loss and packets after packet loss and utilizing their pitch fluctuation characteristics and power fluctuation characteristics to estimate the pitch fluctuation and power fluctuation of the voice loss segment. Further, it discloses a method of reproducing the voice waveform of a voice loss segment by using these estimated characteristics. Further, Patent Literature 3 discloses a method of calculating an optimal matching waveform with a signal of voice packets input prior to loss by a non-standard differential operation and determining an interpolated signal in which the signal of the voice packets input prior to loss is interpolated based on the minimum value of the calculated results.
Patent Literature 1: Japanese Laid-open Patent Publication No. 2001-228896
Patent Literature 2: International Publication Pamphlet No. WO2004/068098
Patent Literature 3: Japanese Laid-open Patent Publication No. 02-4062
According to the above conventional methods for waveform interpolation of voice loss, a waveform is extracted from immediately before or immediately after a lost packet, its pitch period is extracted, and the pitch waveform is repeated so as to generate an interpolated voice waveform. In this case, as the waveform is extracted from immediately before or immediately after the lost packet, regardless of the type of the extracted waveform, the pitch waveform is repeated in the same way in all cases to generate an interpolated voice waveform.
If the immediately proceeding waveform used in generating the above waveform of the interpolated voice is a steady waveform having an amplitude of a constant level or greater and a low amplitude fluctuation such as in for example the vicinity of the middle of a vowel, a voice waveform with almost no voice quality deterioration can be generated. However, if packet loss occurs at, for example, a transition part at which the formant greatly changes from a vowel to a consonant or at the end of a breath group etc., there are cases where even if the above waveform used in the generation of the interpolated voice waveform is a cyclic waveform having high self-correlation, the waveform will become reproduced noise like a buzzing noise and cause sound deterioration. This is shown in the illustrations.
The waveform of Pb′ is at a glance a clean waveform, but if it is reproduced as an actual voice, it will become a buzzing sound that is uncomfortable for the user.
According to an aspect of the embodiments, the apparatus may be a voice waveform interpolating apparatus which does not generate unpleasant reproduction sounds.
Further, a voice waveform interpolating method for accomplishing this and a voice waveform interpolating program for a computer may be provided.
The above apparatus, as explained using the following figures, comprises:
(i) a voice storage unit storing voice data,
(ii) an interpolated waveform generation unit generating voice data in which a part of the voice data is interpolated by another part of the voice data,
(iii) a waveform combining unit combining voice data from the voice storage unit with interpolated voice data from the interpolated waveform generation unit replacing part of the same, and
(iv) an interpolated waveform setting function unit judging if a part of the voice data is appropriate as interpolated voice data for interpolation in the interpolated waveform generation unit, selecting the voice data that is deemed appropriate, and setting this voice data as the interpolated voice data. Among these, the interpolated waveform setting function unit of the above (iv) may be a characterizing constituent.
This interpolated waveform setting function unit (iv) includes, in further detail, an amplitude information analyzing part analyzing the amplitude information for the voice data from the voice storage unit and a voice waveform judging unit judging based on the analysis results if this voice data is appropriate as the interpolated voice data.
In further detail, the amplitude information per frame unit of the voice data is calculated to find the amplitude envelope from the amplitude value of the time direction, and the position on the amplitude envelope of the neighboring waveform to be used in waveform interpolation is identified based on this amplitude envelope. It is judged in the above voice waveform judging unit from the amplitude information of this identified position if this is a waveform appropriate for repetition as in the above.
Here, the interpolated waveform setting function unit 5 includes an amplitude information analyzing part 6 analyzing the amplitude information for the voice data Din from the voice storage unit 2 and a voice waveform judging unit 7 judging if the interpolated voice data Dc is appropriate based on the analysis results.
In
Here, the voice waveform judging unit 7 judges if the interpolated voice data Dc is appropriate according to the position of the amplitude envelope specified from the amplitude information of the time direction. Note that the “SW” illustrated in the upper right of this figure is a switch for transmitting the input voice data Din as the output voice data Dout as it is or alternatively switching it to voice data including the interpolated voice data Dc from the waveform combining unit 5 obtained by interpolation. Here, to facilitate understanding of the principle of the embodiments,
In order to explain the judgment method of this voice waveform judging unit 7,
In this case, at what positions on the amplitude envelope EV the candidates are located are the judgment criteria. Here, if analyzing the amplitude envelope EV of
A voice interpolating apparatus used in a voice editing/processing system and a voice waveform interpolating apparatus used in a packet communication system is realized by the principle of the above embodiment.
The voice waveform interpolating apparatus used in the former voice editing or processing system comprises a voice storage unit 2 storing a plurality of phoneme pieces, an interpolated waveform generation unit 3 generating voice data Dc in which a part of a series of voice data Din is interpolated by the repeated use of the phoneme pieces, a waveform combining unit 4 combining voice data stored in the voice storage unit 2 with interpolated voice data from the interpolated waveform generation unit 4 replacing part of that voice data, and an interpolated waveform setting function unit 5 judging if a part of the voice data is appropriate as interpolated voice data for interpolation in the interpolated waveform generation unit 3, selecting the voice data deemed appropriate, and setting this voice data as the interpolated voice data. If this voice waveform interpolating apparatus is used, it is possible to judge the appropriateness of a phoneme piece, for example, (i) when determining the phoneme boundary of consonants in the labeling of a synthesized voice waveform, (ii) when arranging phoneme pieces during voice synthesis, or (iii) when determining a phoneme piece in which the phoneme piece length is elongated when altering speech speed.
The voice waveform interpolating apparatus used in the latter packet communication system comprises a voice storage unit 2 storing the voice data of each normally received packet in sequence from each packet successively received, an interpolated waveform generation unit 3 which, when a part of the voice data Din is missing due to packet loss (discard or delay), interpolates the missing part with another part of the voice data Din to generate voice data Dc, a waveform combining unit 4 combining the voice data Din stored in the voice storage unit 2 with the interpolated voice data Dc from the interpolated waveform generation unit 3 replacing a part of the same, and an interpolated waveform setting function unit 5 judging if a part of the voice data Din is appropriate as interpolated voice data Dc for interpolation in the waveform generation unit 3, selecting the voice data deemed appropriate, and setting this voice data as the interpolated voice data.
The interpolated waveform setting function unit 5 comprises an amplitude value calculation unit 8, amplitude information storage unit 9, and voice waveform judging unit 7. In packet communication in the above packet communication network, the input voice data Din is stored in the voice storage unit 2 at segments where packets are normally received. The amplitude value calculation unit 8 calculates the amplitude values in frame units from the voice data Din in the voice storage unit 2 and thereby obtains amplitude envelope information, the maximum amplitude value, the minimum amplitude value, and other amplitude information. The amplitude information storage unit 9 stores the amplitude information calculated by the amplitude value calculation unit 8.
When packet loss has occurred, the voice waveform judging unit 7 identifies the position of a waveform piece on the amplitude envelope (EV) when the waveform piece before or after the lost packet is input from the voice storage unit 2. It is judged if a waveform to be made a candidate for the interpolated waveform is at a relative minimum on the amplitude envelope (EV) or at a part Pd immediately before an unvoiced segment S. The judgment results are notified to the interpolated waveform generation unit 3.
The interpolated waveform generation unit 3 generates a waveform in the segment at which a packet was lost according to the judgment results. Further, the waveform combining unit 4 combines the voice waveform for a segment normally received and the waveform for an interpolated segment generated in the interpolated waveform generation unit 3 so that these waveforms are bridged so as to obtain a smooth output voice data Dout.
When the voice waveform judging unit 7 judges that the position on the amplitude envelope (EV) of interpolated voice data Dc as a candidate for replacing the voice loss is, at least, at the relative minimums Pc1, Pc2 of the amplitude or at the position Pd immediately before an unvoiced segment, the voice data of the related part is not used as interpolated voice data Dc. Other voice data at positions other than the voice data of the relevant part are searched for or background noise segments are searched for (refer to
Further, the voice waveform judging unit 7 selects at least one of the preceding (backward) voice data sequentially appearing earlier on the time axis in voice data Din to be interpolated and succeeding (forward) voice data appearing later on the time axis in the voice data Din for candidates to become interpolated voice data for replacing the above voice loss (refer to
Note that, in the method of generation of the interpolated waveform, a variety of waveforms may be combined, e.g. (i) a noise waveform may be overlaid on an interpolated waveform generated by waveform repetition, and (ii) when a series of packet losses occur for a long period of time, the lost packets may be divided into a first and second half, wherein the method of generation of the waveform may be changed for the first and second half, respectively.
The input voice data Din is input to the voiced sound/unvoiced sound judging unit 11 and divided into a voice segment and unvoiced segment. The next amplitude value calculation unit 8 calculates the amplitude value of the voice in frame units (for example 4 msec) from the input voice data Din stored in the voice storage unit 2. Based on the information of the amplitude envelope (EV) indicating the changes of the amplitude value in the time direction as well as the judgment results of the division by the above voiced sound/unvoiced sound judging unit 11, the maximum value and minimum value in the voiced segment and the average amplitude in the speech segment are calculated. Further, the amplitude information storage unit 9 stores both the amplitude information calculated by the amplitude value calculation unit 8 and the judgment results of the voiced sound/unvoiced sound by the unit 11.
When packet loss has occurred and the waveform parts before (or after) the lost packet are input to the voice waveform judging unit 7 from the voice storage unit 2, the positions of the above waveform parts on the amplitude envelope (EV) are identified. Judgment is performed on whether the waveform to be the candidate for interpolation is positioned at a relative minimum on the amplitude envelope (EV) or whether it is positioned at a part immediately before an unvoiced segment S. An example of an actual voice waveform is as illustrated in the above
If introducing the above voiced sound/unvoiced sound judging unit 11, the advantages are gained that the accuracy of calculation of the maximum value, minimum value, and relative minimum increases and the calculation load at the amplitude value calculation unit 8 becomes lighter. In the following, the operation flow when introducing the voiced sound/unvoiced sound judging unit 11 will be explained.
Step S11: It is judged if a packet is normally received.
Step S12: If the packet is normally received (YES), that one packet data (voice data) is fetched.
Step S13: The input voice data Din is stored in the voice storage unit 2.
Step S14: Further, the above voiced sound/unvoiced sound judging unit 11 performs processing for dividing the voice data Din into voiced parts and unvoiced parts,
Step S15: Judgment is performed based on the results of the division.
Step S16: If it is deemed to be “voiced” by the above judgment (YES), the amplitude envelope (EV) of the voice data and the maximum value of the amplitude are calculated.
Step S17: On the other hand, if it is deemed to be “unvoiced” by the above judgment, the average value of the unvoiced amplitude (that is, the minimum value of the unvoiced amplitude) is calculated.
Step S18: The calculated data is stored in the amplitude information storage unit 9.
Step S19: At the above initial step S11, if it is judged that a packet was not normally received (packet loss), judgment by the above waveform judging unit 7 is performed based on the amplitude information stored at step S18.
Step S20: As in the above, interpolated voice data Dc is generated by the interpolated waveform generation unit 3.
Step S21: Further, the input voice data Din and interpolated voice data Dc are smoothly combined by the waveform combining unit 4.
Step S22: The output voice data Dout is obtained. Here, the above step S19 is explained in further detail.
Step S31: The voice waveform judging unit 7 examines the rate of amplitude change at the position, on the amplitude envelope EV (
Step S32: However, judgment of parts which are inappropriate for use as interpolated waveforms is performed by the following three steps with respect to the parts having small rates of amplitude change. First, if an (amplitude value-(minus)minimum amplitude value)<threshold judging as a segment immediately before an unvoiced segment, it is immediately deemed to be inappropriate as an interpolated waveform and then the decision flag is turned OFF (unusable).
Step S33: If the above inequality does not stand (NO), next, it is examined whether the inequality of (amplitude value-minimum amplitude value)<threshold 1 judging as relative minimum stands.
Step S34: If the inequality stands (YES), further, it is examined whether the inequality of (maximum amplitude value-amplitude value)<threshold 2 judging as relative minimum stands.
Step S35: If the inequality stands (YES), the use of the voice data as an interpolated waveform is ultimately disabled (decision flag=OFF). That is, referring to the above
Step S36: Accordingly, if any of the judgment results in the above step S31, S33, and S34 is “NO”, the voice data is permitted to be used as an interpolated waveform (decision flag=ON).
In summary, the third example and the fourth example illustrate a voice waveform interpolating apparatus further provided with a judgment threshold setting unit 12 setting the amplitude judgment threshold T1 for judging the appropriateness of the interpolated voice data Dc in the voice waveform judging unit 7 based on the voice data Din stored in the voice storage unit 2 and the amplitude information stored in the amplitude information storage unit 9. The above fourth example further illustrate a voice waveform interpolating apparatus (
The judgment threshold setting unit 12, to cope with this constantly changing voice data Din, calculates the judgment threshold T1 when judging the voice waveform based on the voice data of the voice storage unit 2 and the amplitude information of the amplitude information storage unit 9 and stores this calculated value T1 in the judgment threshold storage unit 15. Note that, specific examples of each judgment threshold are illustrated in the following.
Breathing group end judgment threshold=(unvoiced segment)amplitude average value×1.2
Relative minimum judgment threshold 1=(voiced segment)minimum amplitude value×1.2 (refer to S33 of FIG. 9)
Relative minimum judgment threshold 2=(voiced segment)maximum amplitude value×0.8 (refer to S34 of FIG. 9)
On the other hand, the amplitude usage range setting unit 13 of
Explaining the above (i) to (iii) in further detail:
(i) Time is specified, for example, 3 seconds before a packet loss.
(ii) A segment between unvoiced segments is set to be the amplitude usage range based on the results of judgment of the voiced sound/unvoiced sound judging unit 11, however, the unvoiced segment includes not only segments of only background noise, but also those with frictional sound (for example consonant parts of sound “sa”) and bursting sounds (for example consonant parts of sound “ta”).
(iii) The range of one breath group, that is, the range of talking by one breath, is set to be the amplitude usage range based on the judgment results of the voiced sound/unvoiced sound judging unit 11.
The voice waveform judging unit 7 of
Further, the amplitude information within the amplitude usable range stored in the amplitude usage range storage unit 16 is obtained from the amplitude information storage unit 9 to calculate the minimum amplitude value, maximum amplitude value, etc. Further, the judgment threshold in the judgment threshold storage unit 15 is used for judgment, however, the judgment method at this time is as illustrated in the flowchart in
The speaker identifying unit 14 in the fourth example of
When voice packet loss occurs, speaker identification is performed from the voice data of the voice storage unit 2. The voice waveform judging unit 7 uses the threshold information for each speaker stored in the judgment threshold storage unit 15 so as to judge the waveform. At that time, by using thresholds by speaker, the judgment performance may be further improved.
Different methods of waveform interpolation are considered as explained above. For example, there are the methods illustrated in the above
As a further separate aspect, in cases when using voice waveform data after the lost packet, when it is deemed that the segment immediately after the lost packet segment is inappropriate for use as waveform repetition, judgment of a further later (forward) packet is performed, and when it is deemed that it is appropriate for repeated use, first, the waveform of the above segmented deemed appropriate for repeated use is arranged only once, and the waveform of the above later (forward) packet is repeatedly used to connect it to generate the waveform of the interpolated segment W.
Step S41: An input voice signal (Din), the subject of judgment, is obtained in the interpolated waveform setting function unit 5.
Step S42: It is judged if an input packet consisting the input voice signal is a packet before (backward) or after (forward) the lost packet.
Step S43: If it is a packet before (backward) the lost packet, that waveform (refer to the U segment of
Step S44: If the preceding (backward) packet is judged inappropriate for repeated use for an interpolated segment based on the judgment results (NO),
Step S45: One further previous (backward) packet (V segment of
Step S46: At step S44, if it is deemed appropriate for repeated use in the interpolated segment (YES), the waveform at the interpolated segment is generated with the preceding (backward) waveform deemed appropriate.
Further, a different method of interpolation is as follows.
Step S47: At the above step S42, it is judged if an input packet consisting the input voice signal is a packet before (backward) or after a (forward) lost packet, and if the packet is a later (forward) packet, the judgment for its waveform (refer to Pr of
Step S48: If the later packet is deemed inappropriate for repeated use in the interpolated segment based on the judgment results (NO),
Step S49: One further later (forward) packet is covered by the judgment and similar operations are performed.
Step S50: At step S48, if it is deemed appropriate for repeated use in an interpolated segment (YES), the waveform at the interpolated segment is generated with a later (forward) waveform deemed appropriate.
The voice waveform interpolating apparatus explained above may be expressed as the steps of a method. That is, it is a voice waveform interpolating method generating voice data in which part of the stored voice data Din is interpolated using another part of the voice data, comprising a (i) first step of storing the voice data Din, (ii) a second step judging if a part of the voice data is appropriate as interpolated voice data Dc for interpolation, selecting the voice data deemed appropriate, and setting it as the interpolated voice data Dc, and (iii) a third step combining the voice data stored in the first step (i) with the interpolated voice data Dc set at the second step (ii).
Further, it is a voice waveform interpolating method including in the second step (ii) an analysis step analyzing the amplitude information for the voice data Din stored in the first step (i) and a voice waveform judging step judging its appropriateness for use as the interpolated voice data Dc based on the analysis results.
Further, the above embodiment may be expressed as a computer-readable recording medium storing a voice waveform interpolating program, in which the program is a voice waveform interpolating program generating voice data in which a part of the voice data Din stored in the computer is interpolated with another part of the voice data and executing a (i) first step of storing the voice data Din, (ii) a second step judging if a part of the voice data is appropriate as interpolated voice data Dc for interpolation, selecting the voice data deemed appropriate, and setting it as the interpolated voice data Dc, and (iii) a third step combining the voice data stored in the first step (i) with the interpolated voice data Dc set at the second step (ii).
This application is a continuation application based on International Application No. PCT/JP2007/054849, filed on Mar. 12, 2007, the contents being incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2007/054849 | Mar 2007 | US |
Child | 12585005 | US |