This invention relates to a technique for processing a digital voice signal, in the fields of application of packet voice communication and digital voice storage. More particularly, the invention relates to a data embedding technique in which a portion of encoded voice code (digital code) that has been produced by a voice compression technique is replaced with optional data to thereby embed the optional data in the encoded voice code while maintaining conformance to the specifications of the data format and without sacrificing voice quality.
Such a data embedding technique, in conjunction with voice encoding techniques applied to digital mobile wireless systems, packet voice transmission systems typified by VoIP, and digital voice storage, is meeting with greater demand and is becoming more important as an digital watermark technique, through which the concealment of communication is enhanced by embedding copyright or ID information in a transmit bit sequence without affecting the bit sequence, and as a functionality extending technique.
The explosive growth of the Internet has been accompanied by increasing demand for Internet telephony for the transmission of voice data by IP packets. The transmission of voice data by packets has the advantage of making possible the unified transmission of different media, such as commands and image data. Until now, however, multimedia communication has mainly been transmission independently over different channels. Further, though services through which telephone rates for users are lowered by the insertion of advertisements and the like are also available, such services are provided only at the outset when the call is initiated. In addition, by transmitting voice data in the form of packets, different media such as commands and image data can be transmitted in unified fashion. Since the transmission format is well known, however, a problem arises in terms of concealment of information. With this as a background, digital watermark techniques for embedding copyright information in compressed voice data (code) have been proposed.
In order to raise the efficiency of transmission, voice encoding techniques for the highly efficient compression of voice have been adopted. In particular, in the area of VoIP, voice encoding techniques such as those compliant with G.729 standardized by the ITU-T (International Telecommunications Union-Telecommunications Standardization Sector) are dominant. Voice encoding techniques such as AMR (Adaptive Multi-Rate) standardized by 3GPP (3rd Generation Partnership Project) have been adopted even in the field of mobile communications. What these techniques have in common is that they are based upon an algorithm referred to as CELP (Code Excited Linear Prediction). Encoding and decoding schemes compliant with G.729 are as set forth below.
Structure and Operation of Encoder
H(z)=1/[1+Σαi·z−i](i=1 to M) (1)
Generally, in the case of voice in the telephone band, a value of 10 to 12 is used as p. The LPC analyzer 1 performs LPC analysis using 80 samples of the input signal, 40 pre-read samples and 120 past signal samples, for a total of 240 samples, and obtains the LPC coefficients.
A parameter converter 2 converts the LPC coefficients to LSP (Line Spectrum Pair) parameters. An LSP parameter is a parameter of a frequency region in which mutual conversion with LPC coefficients is possible. Since a quantization characteristic is superior to LPC coefficients, quantization is performed in the LSP domain. An LSP quantizer 3 quantizes an LSP parameter obtained by the conversion and obtains an LSP code and an LSP dequantized value. An LSP interpolator 4 obtains an LSP interpolated value from the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame. More specifically, one frame is divided into two subframes, namely first and second subframes, of 5 ms each, and the LPC analyzer 1 determines the LPC coefficients of the second subframe but not of the first subframe. Using the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame, the LSP interpolator 4 predicts the LSP dequantized value of the first subframe by interpolation.
A parameter deconverter 5 converts the LSP dequantized value and the LSP interpolated value to LPC coefficients and sets these coefficients in an LPC synthesis filter 6. In this case, the LPC coefficients converted from the LSP interpolated values in the first subframe of the frame and the LPC coefficients converted from the LSP dequantized values in the second subframe are used as the filter coefficients of the LPC synthesis filter 6. In the description that follows, the “1” in items having a subscript attached to the “1”, e.g., lspi, li(n), . . . , is the letter “1” in the alphabet.
After LSP parameters lspi (i=1, . . . , M) are quantized by vector quantization in the LSP quantizer 3, the quantization indices (LSP codes) are sent to a decoder.
Next, excitation and gain search processing is executed. Excitation and gain are processed on a per-subframe basis. First, a excitation signal is divided into a periodic component and a non periodic component, an adaptive codebook 7 storing a sequence of past excitation signals is used to quantize the periodic component and an algebraic codebook or fixed codebook is used to quantize the non periodic component. Described below will be voice encoding using the adaptive codebook 7 and a fixed codebook 8 as excitation codebooks.
The adaptive codebook 7 is adapted to output N samples of excitation signals (referred to as “periodicity signals”), which are delayed successively by one sample, in association with indices 1 to L, where N represents the number of samples in one subframe. The adaptive codebook 7 has a buffer for storing the periodic component of the latest (L+39) samples. A periodicity signal comprising 1st to 40th samples is specified by index 1, a periodicity signal comprising 2nd to 41st samples is specified by index 2, . . . , and a periodicity signal comprising Lth to (L+39)th samples is specified by index L. In the initial state, the content of the adaptive codebook 7 is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe in terms of time so that the excitation signal obtained in the present frame will be stored in the adaptive codebook 7.
An adaptive-codebook search identifies the periodicity component of the excitation signal using the adaptive codebook 7 storing past excitation signals. That is, a subframe length (=40 samples) of past excitation signals in the adaptive codebook 7 is extracted while changing, one sample at a time, the point at which read-out from the adaptive codebook 7 starts, and the excitation signals are input to the LPC synthesis filter 6 to create a pitch synthesis signal βAPL, where PL represents a past pitch periodicity signal (adaptive excitation vector), which corresponds to delay L, extracted from the adaptive codebook 7, A the impulse response of the LPC synthesis filter 6, and β the gain of the adaptive codebook.
An arithmetic unit 9 finds an error power EL between the input voice X and βAPL in accordance with the following equation:
EL=|X−βAPL|2 (2)
If we let APL represent a weighted synthesized output from the adaptive codebook, Rpp the autocorrelation of APL and Rxp the cross-correlation between APL and the input signal X, then an adaptive excitation vector PL at a pitch lag Lopt for which the error power of Equation (2) is minimum will be expressed by the following equation:
PL=argmax(Rxp2/Rpp) (3)
That is, the optimum starting point for read-out from the codebook is that at which the value obtained by normalizing the cross-correlation Rxp between the pitch synthesis signal APL and the input signal X by the autocorrelation Rpp of the pitch synthesis signal is largest. Accordingly, an error-power evaluation unit 10 finds the pitch lag Lopt that satisfies Equation (3). Optimum pitch gain βopt is given by the following equation:
βopt=Rxp/Rpp (4)
Next, the non periodic component contained in the excitation signal is quantized using the fixed codebook 8. The latter is constituted by a plurality of pulses of amplitude 1 or −1. By way of example, Table 1 illustrates pulse positions for a case where subframe length is 40 samples.
The algebraic codebook 8 divides the N (=40) sampling points constituting one subframe into a plurality of pulse-system groups 1 to 4 and, for all combinations obtained by extracting one sampling point m0˜m3 from each of the pulse-system groups, successively outputs, as non periodic components, pulsed signals having a +1 or a −1 pulse at each sampling point. In this example, basically four pulses are deployed per subframe.
(1) Eight sampling points 0, 5, 10, 15, 20, 25, 30, 35 are assigned to the pulse-system group 1;
(2) eight sampling points 1, 6, 11, 16, 21, 26, 31, 36 are assigned to the pulse-system group 2;
(3) eight sampling points 2, 7, 12, 17, 22, 27, 32, 37 are assigned to the pulse-system group 3; and
(4) 16 sampling points 3, 4, 8, 9, 13, 14, 18, 19, 23, 24, 28, 29, 33, 34, 38, 39 are assigned to the pulse-system group 4.
Three bits are required to express the sampling points in pulse-system groups 1 to 3 and one bit is required to express the sign of a pulse, for a total of four bits. Further, four bits are required to express the sampling points in pulse-system group 4 and one bit is required to express the sign of a pulse, for a total of five bits. Accordingly, 17 bits are necessary to specify a pulsed excitation signal output from the fixed codebook 8 having the pulse placement of Table 1, and 217 (=24×24×24×25) types of pulsed excitation signals exist.
The pulse positions of each of the pulse systems are limited, as illustrated in Table 1. In the fixed codebook search, a combination of pulses for which the error power relative to the input voice is minimized in the reconstruction region is decided from among the combinations of pulse positions of each of the pulse systems. More specifically, with βopt as the optimum pitch gain found by the adaptive-codebook search, the output PL of the adaptive codebook is multiplied by βopt and the product is input to an adder 11. At the same time, the pulsed excitation signals are input successively to the adder 11 from the fixed codebook 8 and a pulsed excitation signal is specified that will minimize the difference between the input signal X and a reproduced signal obtained by inputting the adder output to the LPC synthesis filter 6. More specifically, first a target vector X′ for a fixed codebook search is generated in accordance with the following equation from the optimum adaptive codebook output PL and optimum pitch gain βopt obtained from the input signal X by the adaptive-codebook search:
X′=X−βoptAPL (5)
In this example, pulse position and amplitude (sign) are expressed by 17 bits and therefore 217 combinations exist. Accordingly, letting CK represent a kth excitation vector, a excitation vector CK that will minimize an evaluation-function error power D in the following equation is found by a search of the fixed codebook:
D=|X′−GcACK|2 (6)
where GC represents the gain of the fixed codebook. In the fixed codebook search, the error-power evaluation unit 10 searches for the combination of pulse position and polarity that will afford the largest normalized cross-correlation value (Rcx*Rcx/Rcc) obtained by normalizing the square of a cross-correlation value Rcx between a noise synthesis signal ACK and input signal X′ by an autocorrelation value Rcc of the noise synthesis signal.
Gain quantization will be described next. With the G.729system, fixed codebook gain is not quantized directly. Rather, the adaptive codebook gain Ga (=βopt) and a correction coefficient γ of the fixed codebook gain Gc are vector quantized. The fixed codebook gain Gc and the correction coefficient γ are related as follows:
GC=g′×γ
where g′ represents the gain of the present frame predicted from the logarithmic gains of the four past subframes.
A gain quantizer 12 has a gain quantization table, not shown, for which there are prepared 128 (=27) combinations of adaptive codebook gain Ga and correction coefficients γ for fixed codebook gain. The method of the gain codebook search includes {circle around (1)} extracting one set of table values from the gain quantization table with regard to an output vector from the adaptive codebook and an output vector from the fixed codebook and setting these values in gain varying units 13, 14, respectively; {circle around (2)} multiplying these vectors by gains Ga, Gc using the gain varying units 13, 14, respectively, and inputting the products to the LPC synthesis filter 6; and {circle around (3)} selecting, by way of the error-power evaluation unit 10, the combination for which the error power relative to the input signal X is smallest.
A channel multiplexer 15 creates channel data by multiplexing {circle around (1)} an LSP code, which is the quantization index of the LSP, {circle around (2)} a pitch-lag code Lopt, which is the quantization index of the adaptive codebook, {circle around (3)} a noise code, which is an fixed codebook index, and {circle around (4)} a gain code, which is a quantization index of gain. In actuality, it is necessary to perform channel encoding and packetization processing before transmission to the transmission line
Decoder Structure and Operation
Upon receiving the LSP code as an input, an LSP dequantizer 22 applies dequantization and outputs an LSP dequantized value. An LSP interpolator 23 interpolates an LSP dequantized value of the first subframe of the present frame from the LSP dequantized value in the second subframe of the present frame and the LSP dequantized value in the second subframe of the previous frame. Next, a parameter deconverter 24 converts the LSP interpolated value and the LSP dequantized value to LPC synthesis filter coefficients. A G.729A-compliant synthesis filter 25 uses the LPC coefficient converted from the LSP interpolated value in the initial first subframe and uses the LPC coefficient converted from the LSP dequantized value in the ensuing second subframe.
An adaptive codebook 26 outputs a pitch signal of subframe length (=40 samples) from a read-out starting point specified by a pitch-lag code, and a fixed codebook 27 outputs a pulse position and pulse polarity from a read-out position that corresponds to an algebraic code. A gain dequantizer 28 calculates an adaptive codebook gain dequantized value and a fixed codebook gain dequantized value from the gain code applied thereto and sets these values in gain varying units 29, 30, respectively. An adder 31 creates a excitation signal by adding a signal, which is obtained by multiplying the output of the adaptive codebook by the adaptive codebook gain dequantized value, and a signal obtained by multiplying the output of the fixed codebook by the fixed codebook gain dequantized value. The excitation signal is input to an LPC synthesis filter 25. As a result, reproduced voice can be obtained from the LPC synthesis filter 25.
In the initial state, the content of the adaptive codebook 26 on the decoder side is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe in terms of time so that the excitation signal obtained in the present frame will be stored in the adaptive codebook 26. In other words, the adaptive codebook 7 of the encoder and the adaptive codebook 26 of the decoder are always maintained in the identical, latest state.
Digital Watermark Technique
The specification of Japanese Patent Application Laid-Open No. 11-272299 discloses a “Method of Embedding Watermark Bits when Encoding Voice” as an digital watermark technique to which CELP is applied.
If watermark information is embedded in all codes using the same key Kp, there is a good possibility of decryption by an unauthorized third party. This makes it necessary to enhance concealment. If the total value of m0 to m3 is represented by Cp, the total value will be any of the 58 shown at (a) of
In cases where other media are transmitted on channels that are independent of the voice channel, basically it is required that the terminals at both ends provide multichannel support. A problem which arises in such cases is that limitations are imposed at the terminals connected to a conventional communications network. This is true with regard to 2nd generation mobile telephones, for example, which presently are in most widespread use. Further, even if the terminals at both ends offer multichannel support and make it possible to transmit a plurality of media, routes have a random nature in the case of packet switching, making it difficult to achieve synchronization and linkage at repeaters along the way. A particular problem is that complicated control such as route setting and synchronization processing is required for linkage that employs data accompanying voice per se issued by a specific user.
With the conventional digital watermark technique, use of a key is essential. In addition, the target of embedded data is limited to a pulse position in the fourth pulse system of the fixed codebook. As a consequence, there is a good possibility that the existence of the key will become known to the user. If the user becomes aware of the key, the user can specify the embedded position. This leads to the possibility of leakage and falsification of data.
Further, with the conventional digital watermark technique, since the foregoing is “probability-based” control in which execution or non-execution of data embedding depends upon the total value of pulse position candidates, there is a possibility that the sound-quality degrading effect of embedding of data will be significant. There is need for a data embedding technique as a communication standard in which the embedding of data is concealed, i.e., in which there is no decline in sound quality when decoding (reproduced voice) is performed at a terminal. However, since the prior-art technique results in degraded sound quality, it has not been able to satisfy this need.
Accordingly, an object of the present invention is to so arrange it that data can be embedded in encoded voice code on the encoder side and extracted correctly on the decoder side without both the encoder and decoder sides possessing a key.
Another object of the present invention is to so arrange it that there is almost no decline in sound quality even if data is embedded in encoded voice code, thereby making the embedding of data concealed to the listener of reproduced voice.
A further object of the present invention is to make the leakage and falsification of embedded data difficult to achieve.
Still another object of the present invention is to so arrange it that both data and control code can be embedded, thereby enabling the decoder side to execute processing in accordance with the control code.
Another object of the present invention is to so arrange it that the transmission capacity of embedded data can be increased.
Another object of the present invention is to make it possible to transmit multimedia such as voice, images and personal information on a voice channel alone.
Another object of the present invention is to so arrange it that any information such as advertisement information can be provided to end users performing mutual communication of voice data.
Another object of the present invention is to so arrange it that sender, recipient, receive time and call category, etc., can be embedded and stored in voice data that has been received.
According to a first aspect of the present invention, when optional data is embedded in encoded voice code, it is determined whether data embedding conditions are satisfied using a first element code, from among element codes constituting the encoded voice code, and a threshold value, and optional data is embedded in the encoded voice code by replacing a second element code with the optional data if the data embedding conditions are satisfied. More specifically, the first element code is a fixed codebook gain code and the second element code is a noise code, which is an index of a fixed codebook. When a dequantized value of the fixed codebook gain code is smaller than the threshold value, it is determined that the data embedding conditions are satisfied and the noise code is replaced with prescribed data, whereby the data is embedded in the encoded voice code. In another concrete example, the first element code is a pitch-gain code and the second element code is a pitch-lag code, which is an index of an adaptive codebook. When a dequantized value of the pitch-gain code is smaller than the threshold value, it is determined that the data embedding conditions are satisfied and the pitch-lag code is replaced with optional data, whereby the optional data is embedded in the encoded voice code.
Taking note of two types of code vectors of a excitation signal, namely an adaptive code vector (pitch-lag code) corresponding to the pitch excitation and a fixed code vector (noise code) corresponding to the noise excitation, it is possible to regard gain as being a factor that indicates the degree of contribution of each code vector. Accordingly, gain is defined as a decision parameter. If the gain is less than a threshold value, it is determined that the degree of contribution of the corresponding excitation code vector is low and the index of this excitation code vector is replaced with an optional data sequence. As a result, it is possible to embed optional data while suppressing the effects of this replacement. Further, by controlling the threshold value, the amount of embedded data can be adjusted while taking into account the effect upon reproduced speech quality.
According to a second aspect of the present invention, when extracting data that has been embedded in encoded voice code encoded by a prescribed voice encoding scheme, it is determined whether data embedding conditions are satisfied using a first element code, from among element codes constituting the encoded voice code, and a threshold value, and the embedded data is extracted upon determining that data has been embedded in a second element code portion of the encoded voice code if the data embedding conditions are satisfied. More specifically, the first element code is a fixed codebook gain code and the second element code is a noise code, which is an index of a fixed codebook. When a dequantized value of the fixed codebook gain code is smaller than the threshold value, it is determined that the data embedding conditions are satisfied and the embedded data is extracted from the noise code. In another concrete example, the first element code is a pitch-gain code and the second element code is a pitch-lag code, which is an index of an adaptive codebook. When a dequantized value of the pitch-gain code is smaller than the threshold value, it is determined that the data embedding conditions are satisfied and the embedded data is extracted from the pitch-lag code.
If this arrangement is adopted, data can be embedded in encoded voice code on the encoder side and extracted correctly on the decoder side without both the encoder and decoder sides possessing a key. Further, it can be so arranged that there is almost no decline in sound quality even if data is embedded in encoded voice code, thereby making the embedding of data concealed to the listener of reproduced voice. Further, it can be made difficult to leak or falsify embedded data by changing threshold values.
According to a third aspect of the present invention, a voice encoding apparatus in a system having a voice encoding apparatus and a voice reproducing apparatus encodes voice by a prescribed voice encoding scheme and embeds optional data in the encoded voice code obtained. The voice reproducing apparatus extracts embedded data from the encoded voice code and reproduces voice from the encoded voice code. In this system, a first element code and a threshold value, which are used to determine whether data has been embedded or not, and a second element code in which data is embedded based upon result of the determination, are defined. When the voice encoding apparatus embeds data under these conditions, the voice encoding apparatus determines whether data embedding conditions are satisfied using the first element code, from among element codes constituting the encoded voice code, and the threshold value, and embeds optional data in the encoded voice code by replacing the second element code with the optional data if the data embedding conditions are satisfied. When data is extracted, on the other hand, the voice reproducing apparatus determines whether data embedding conditions are satisfied using the first element code, from among element codes constituting the encoded voice code, and the threshold value, determines that optional data has been encoded in the second element code of the encoded voice code if the data embedding conditions are satisfied, extracts the embedded data and then subjects the encoded voice code to decoding processing.
As a result, if only an initial value of a threshold value is defined in advance on both the transmitting and receiving sides, data can be embedded and extracted without using a key. Further, if a control code is defined as embedded data, a threshold value can be changed using this control code, and the amount of embedded data transmitted can be adjusted by changing the threshold value. Further, whether to embed only a data sequence, or whether to embed a data/control code sequence in a format that makes it possible to identify the type of data and control code, is decided in dependence upon a gain value. In a case where only a data sequence is embedded, therefore, it is unnecessary to include data-type information. This makes possible improvements relating to transmission capacity.
According to a fourth aspect of the present invention, there is provided a digital voice communication system for encoding voice by a prescribed voice encoding scheme and transmitting the encoded voice, comprising means for analyzing voice data obtained by encoding input voice; means for embedding any code in a specific segment of a portion of the voice data in accordance with result of analysis; and means for transmitting the embedded data as voice data; whereby additional data is transmitted at the same time as ordinary voice. According to the fourth aspect of the present invention, there is further provided a digital voice communication system comprising means for analyzing received voice data; and means for extracting code from a specific segment of a portion of the voice data in accordance with result of analysis; whereby additional data is received and output at the same time as ordinary voice.
Multimedia communication becomes possible by adopting image information (video of present surroundings and map images, etc.) and personal information (a portrait photograph, voice print or finger print, etc.), etc., as the additional information. Further, by adopting a terminal serial number or voice print, etc., as the personal information, the performance of authentication as to whether or not an individual is an authorized user can be enhanced. Moreover, it is possible to improve the security of voice data.
Further, the digital voice communication system is provided with a server apparatus for relaying voice data. It can be so arranged that optional information such as advertisement information is provided to end users, who are performing mutual communication of voice data, by the server.
Further, by embedding sender, recipient, receive time and call category, etc., in received voice data and storing the same in storage means, it is possible to put voice data into file form so that subsequent utilization can be facilitated.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings.
(A) Principle of the Present Invention
With a decoder that operates in accordance with the CELP algorithm, a excitation signal is generated based upon an index, which specifies a excitation sequence, and gain information, voice is generated (reproduced) using a synthesis filter constituted by linear prediction coefficients, and reproduced voice is expressed by the following equation:
Srp=H·R=H(Gp·P+Gc·C)=H·Gp·P+H·Gc·C
where Srp represents reproduced voice, H an LPC synthesis filter, Gp adaptive code vector gain (pitch gain), P an adaptive code vector (pitch-lag code), Gc noise code vector gain (fixed codebook gain), and C a noise code vector. The first term on the right side is a pitch-period synthesis signal and the second term is a noise synthesis signal.
As set forth above, digital codes (transmit parameters) encoded according to CELP correspond to feature parameters in a voice generating system. Taking note of these features, is possible to ascertain the status of each transmit parameter. For example, taking note of two types of code vectors of a excitation signal, namely an adaptive code vector corresponding to a pitch excitation and a noise code vector corresponding to a noise excitation, it is possible to regard gains Gp, Gc as being factors that indicate the degree of contribution of the code vectors P, C, respectively. More specifically, in a case where the gains Gp, Gc are low, the degrees of contribution of the corresponding code vectors are low. Accordingly, the gains Gp, Gc are defined as decision parameters. If gain is less than a threshold value, it is determined that the degree of contribution of the corresponding excitation code vector P, C is low and the index of this excitation code vector is replaced with an optional data sequence. As a result, it is possible to embed optional data while suppressing the effects of this replacement. Further, by controlling the threshold value, the amount of embedded data can be adjusted while taking into account the effect upon reproduced speech quality.
This technique is such that if only an initial value of a threshold value is defined in advance on both the transmitting and receiving sides, whether or not embedded data exists and the location of embedded data can be determined and, moreover, the writing/reading of embedded data can be performed based solely upon decision parameters (pitch gain and fixed codebook gain) and embedding target parameters (pitch lag and noise code). In other words, transmission of a specific key is not required. Further, if a control code is defined as embedded data, the amount of embedded data transmitted can be adjusted merely by specifying a change in the threshold value by the control code.
Thus, by applying this technique, it is possible to embed any data without changing the encoding format. In other words, an ID or other media information can be embedded in voice information and transmitted/stored without sacrificing the compatibility that is essential in communication/storage applications and without the user being aware. In addition, according to the present invention, control specifications are stipulated by parameters common to CELP. This means that the invention is not limited to a specific scheme and therefore can be applied to a wide range of schemes. For example, G.729 suited to VoIP and AMR suited to mobile communications can be supported.
(a) General Structure
In
The codec 51 encodes input voice in accordance with G.729 and inputs the encoded voice code thus obtained to the data embedding controller 53. As shown in Table 2 below, the G.729-compliant encoded voice code has the following as element codes: an LSP code, an adaptive codebook index (pitch-lag code), a fixed codebook index (noise code) and a gain code. The gain code is obtained by combining and encoding pitch gain and fixed codebook gain.
The embedding decision unit 54 of the data embedding controller 53 uses the dequantized value of the gain code and the threshold value TH to determine whether data embedding conditions are satisfied, and the data embedding unit 55 replaces noise code with prescribed data to thereby embed the data in the encoded voice code if the data embedding conditions are satisfied. If the data embedding conditions are not satisfied, the data embedding unit 55 outputs the noise element code as is. The multiplexer 56 multiplexes and transmits the element codes that construct the encoded voice code.
The embedding decision unit 54 has the structure shown in
Table 3 below illustrates the result of a simulation in a case where the noise code (17 bits) serving as the fixed codebook index is replaced with any data if gain is less than a certain value in the G.729 voice encoding scheme. Table 3 illustrates the results of evaluating, by SNR, a change in sound quality in a case where voice is reproduced upon adopting randomly generated data as any data and regarding this random data as noise code, as well as the proportion of a frame replaced with embedded data. It should be noted that the threshold values in Table 3 are gain index numbers; the greater the number of index values, the larger the gain serving as the threshold value. Further, SNR is the ratio (in dB) of the excitation signal in a case where the noise code in the encoded voice code is not replaced with data, to an error signal representing the difference between the excitation signal in a case where the noise code is not replaced with data and the excitation signal in a case where the noise code is replaced with data; SNRseg represents the SNR on a per-frame basis; and SNRtot represents the average SNR over the entire voice interval. The proportion (%) is that at which data is embedded once the gain has fallen below the corresponding threshold value in a case where a standard signal is input as the voice signal.
As shown in Table 3, setting the threshold value of the fixed codebook gain to 12 makes it possible to replace 43% of the total transmission capacity of the fixed codebook gain index (noise code) with any data. In addition, even if decoding is performed as is by the decoder, the difference in sound quality can be held to a small 0.1 dB (=11.60−11.50) in comparison with a case where no data is embedded (i.e., a case where the threshold value is 0). This means that there is no decline in sound quality in G.729, and that it is possible to transmit any data at as high as 1462 bits/s [=0.43×17×(1000/5)]. Further, by raising or lowering the threshold value, the transmission capacity (proportion) of embedded data can also be adjusted while taking into account the effect upon sound quality. For example, if a change in sound quality of 0.2 dB is allowed, the transmission capacity can be increased to 46% (1564 bits/s) by setting the threshold value to 20.
The codec 51 encodes input voice in accordance with G.729 and inputs the encoded voice code thus obtained to the data embedding controller 53. The embedding decision unit 54 of the data embedding controller 53 uses the dequantized value (pitch gain) of the gain code and the threshold value TH to determine whether data embedding conditions are satisfied, and the data embedding unit 55 replaces pitch-lag code with prescribed data to thereby embed the data in the encoded voice code if the data embedding conditions are satisfied. If the data embedding conditions are not satisfied, the data embedding unit 55 outputs the pitch-lag element code as is. The multiplexer 56 multiplexes and transmits the element codes that construct the encoded voice code.
The embedding decision unit 54 has the structure shown in
Table 4 below illustrates the result of a simulation in a case where the pitch-lag code (13 bits/10 ms) serving as the adaptive codebook index is replaced with optional data if gain is less than a certain value in the G.729 voice encoding scheme. Table 4 illustrates the results of evaluating, by SNR, a change in sound quality in a case where voice is reproduced upon adopting randomly generated data as the optional data and regarding this random data as pitch-lag code, as well as the proportion of a frame replaced with embedded data.
As shown in Table 4, setting the threshold value to gain 0.5 makes it possible to replace 45% of the total transmission capacity of the pitch-lag code, which is the adaptive codebook index. In addition, even if decoding is performed as is by the decoder, the difference in sound quality can be held to a small 0.24 dB (=11.60−11.36).
(a) General Structure
The data extraction unit 62, which has an embedding decision unit 65 and an assignment unit 66, extracts data from encoded voice code as appropriate. Using a first element code, which is from among element codes constituting the encoded voice code, and a threshold value TH, the embedding decision unit 65 determines whether data embedding conditions are satisfied. If these conditions are satisfied, the assignment unit 66 regards a second element code from among the element codes as embedded data, extracts the embedded data and sends this data to the data processor 63. The assignment unit 66 inputs the entered second element code to the decoder 64 as is regardless of whether the data embedding conditions are satisfied or not.
Upon receiving encoded voice code, the demultiplexer demultiplexes the encoded voice code into element codes and inputs these to the data extraction unit 62. On the assumption that encoding has been performed in accordance with G.729, the demultiplexer 61 demultiplexes the encoded voice code into LSP code, pitch-lag code, noise code and gain code and inputs these to the data extraction unit 62. It should be noted that the gain code is the result of combining pitch gain and fixed codebook gain and quantizing (encoding) these using a quantization table.
Using the dequantized value of the gain code and the threshold value TH, the embedding decision unit 65 of the data extraction unit 62 determines whether data embedding conditions are satisfied. If data embedding conditions are satisfied, the assignment unit 66 regards the noise code as embedded data, inputs the embedded data to the data processor 63 and inputs the fixed codebook to the decoder 64 in the form in which it was applied thereto. If the data embedding conditions are not satisfied, the assignment unit 66 inputs the noise code to the decoder 64 in the form in which it was applied thereto.
The embedding decision unit 65 has the structure shown in
By adopting the most significant bit (MSB) as a bit indicative of the type of data, data and a control code can be embedded in the remaining (M−1)-number of bits in a form distinguished from each other, as illustrated in
Upon receiving encoded voice code, the demultiplexer 61 demultiplexes the encoded voice code into element codes and inputs these to the data extraction unit 62. On the assumption that encoding has been performed in accordance with G.729, the demultiplexer 61 demultiplexes the encoded voice code into LSP code, pitch-lag code, noise code and gain code and inputs these to the data extraction unit 62. It should be noted that the gain code is the result of combining pitch gain and fixed codebook gain and quantizing (encoding) these using a quantization table.
Using the dequantized value of the gain code and the threshold value TH, the embedding decision unit 65 of the data extraction unit 62 determines whether data embedding conditions are satisfied. If data embedding conditions are satisfied, the assignment unit 66 regards the pitch-lag code as embedded data, inputs the embedded data to the data processor 63 and inputs the pitch-lag code to the decoder 64 in the form in which it was applied thereto. If the data embedding conditions are not satisfied, the assignment unit 66 inputs the pitch-lag code to the decoder 64 in the form in which it was applied thereto.
The embedding decision unit 65 has the structure shown in
The voice/audio CODEC (encoder) 51 encodes input voice in accordance with, e.g., G.729, and outputs the encoded voice code (encoded data) obtained. The encoded voice code is composed of a plurality of element codes. The embed data generator 52 generates two types of data sequences to be embedded in the encoded voice code. The first data sequence is one comprising only media data, for example, and the second data sequence is a data/control code sequence having the data-type bit illustrated in
The data embedding controller 53, which has the embedding decision unit 54 and the data embedding unit 55 constructed as a selector, embeds data in encoded voice code as appropriate. Using a first element code, which is from among element codes constituting the encoded voice code, and threshold values TH1, TH2 (TH2>TH1), the embedding decision unit 54 determines whether data embedding conditions are satisfied. If these conditions are satisfied, the embedding decision unit 54 then determines whether the embedding conditions satisfied concern a data sequence comprising only media data or a data/control code sequence having the data-type bit. For example, the embedding decision unit 54 determines that the data embedding conditions are satisfied if the dequantized value of the first element code satisfies the relation {circle around (1)} TH2<G, that embedding conditions concerning a data/control code sequence having the data-type bit are satisfied if the relation {circle around (2)} TH1≦G<TH2 holds, and that embedding conditions concerning a data sequence comprising only media data are satisfied if the relation {circle around (3)} G<TH1 holds.
If {circle around (1)} TH1≦G<TH2 holds, the data embedding unit 55 replaces a second element code with a data/control code sequence having the data-type bit, which is generated by the embed data generator 52, thereby embedding this data in the encoded voice code. If {circle around (2)} G<TH1 holds, the data embedding unit 55 replaces the second element code with a media data sequence, which is generated by the embed data generator 52, thereby embedding this data in the encoded voice code. If {circle around (3)} TH2<G holds, the data embedding unit 55 outputs the second element code as is. The multiplexer 56 multiplexes and transmits the element codes that construct the encoded voice code.
In a case where an encoder compliant with the G.729 encoding scheme is used as the encoder, the value conforming to the first element code is either fixed codebook gain or pitch gain, and the second element code is either a noise code or a pitch-lag code.
Upon receiving encoded voice code, the demultiplexer 61 demultiplexes the encoded voice code into element codes and inputs these to the data extraction unit 62. The latter extracts a data sequence or data/control code sequence from a first element code from among the demultiplexed element codes, inputs this data to a data processor 63 and applies each of the entered element codes to a voice/audio CODEC (decoder) 64 as is. The decoder 64 decodes the entered encoded voice code, reproduces voice and outputs the same.
The data extraction unit 62, which has an embedding decision unit 65 and an assignment unit 66, extracts a data sequence or a data/control code sequence from encoded voice code as appropriate. Using a value conforming to the first element code, which is a code from among element codes constituting the encoded voice code, and threshold values TH1, TH2 (TH2>TH1) shown in
If {circle around (1)} TH1≦G<TH2 holds, the assignment unit 66 regards the second element code as the data/control code sequence having the data-type bit, inputs this to the data processor 63 and the inputs the second element code to the decoder 64. If {circle around (2)} G<TH1 holds, the assignment unit 66 regards the second element code as a data sequence comprising media data, inputs this to the data processor 63 and the inputs the second element code to the decoder 64. If {circle around (3)} TH2<G holds, the assignment unit 66 regards this as indicating that data has not been embedded in the second element code and inputs the second element code to the decoder 64.
In a case where encoded voice code that has been encoded in accordance with G.729 encoding is received, the value conforming to the first element code is fixed codebook, gain or pitch gain, and the second element code is noise code or pitch-lag code.
The foregoing has been described for a case where the present invention is applied to a voice communication system that transmits voice from a transmitter having an encoder to a receiver having a decoder. However, the present invention is not limited to such a voice communication system but is applicable to other systems as well. For example, the present invention can be applied to a recording/playback system in which voice is encoded and recorded on a storage medium by a recording apparatus having an encoder, and voice is reproduced from the storage medium by a playback apparatus having a decoder.
(E) Digital Voice Communication System
(a) System for Implementing Image Transmission Service
The other party's terminal B 200 has a transmit processor 204 for receiving the encoded voice code data from the public network 300 and inputting this data to an extraction unit 205. The latter corresponds to the data extraction unit 62 illustrated in the embodiment of
It should be noted that that control for embedding image data in encoded voice code data, transmitting the resultant data from the terminal B to the terminal A and outputting the image at terminal A also is executed in a manner similar to that described above.
In accordance with the digital voice communication system of
(b) System for Implementing Authentication Information Transmission Service
In accordance with the digital voice communication system of
(c) System for Implementing Key Information Transmission Service
The transmit processor 204 of the other party's terminal B 200 receives the encoded voice code data from the public network 300 and inputs this data to the extraction unit 205. In accordance with an embedding criterion identical with that of the embodiment of
(d) System for Implementing a Multipoint Access Service
A previously set relation address has been stored in an internal memory of the relation address input unit 131 in advance. This relation address may be an alternative IP telephone address or e-mail address of terminal A or an IP telephone number or an e-mail address of a facility other than terminal A or of another site. In accordance with an embedding criterion identical with that of the embodiment of
The transmit processor 204 of the other party's terminal B 200 receives the encoded voice code data from the public network 300 and inputs this data to the extraction unit 205. In accordance with an embedding criterion identical with that of the embodiment of
The display-key unit DPK displays the relation address that has been stored in the relation address storage unit 232. As a result, this relation address can be selected to telephone the address or transfer a mail to the address by a single click.
(e) System for Implementing Advertisement Information Embedding Service
The server 400 includes a bit-stream decomposing/generating unit 401 for extracting a transmit packet from a bit stream that enters from the terminal 100 on the transmitting side, specifying the sender and recipient from the IP header of this packet, specifying the media type and encoding scheme from the RTP header, determining whether advertisement-information insertion conditions are satisfied based upon these items of information and inputs encoded voice code data of the transmit packet to an embedding unit 402. In accordance with an embedding criterion identical with that of the embodiment of
The transmit processor 204 of the other party's terminal B 200 receives the encoded voice code data from the public network 300 and inputs this data to the extraction unit 205. In accordance with an embedding criterion identical with that of the embodiment of
When a bit stream is input thereto, the server 400 analyzes the header of a transmit packet and the encoded voice data (step 3001). More specifically, the server 400 extracts a transmit packet from the bit stream (step 3001a), extracts the transmit address and receive address from the IP header (step 3001b), determines whether the sender and recipient have concluded an advertising agreement (step 3001c) and, if such an agreement has been concluded, refers to the RTP header to identify the media type and CODEC type (step 3001d). For example, if the media type is voice and the CODEC type is G.729A (“YES” at step 3001e), then, in accordance with an embedding criterion identical with that of the embodiment of
If the server 400 subsequently determines that embedding is possible (“YES” at step 3002), the server embeds the advertisement information provided by the advertiser (the information provider) in the encoded voice code data (step 3003). If the server 400 determines that embedding is not possible (“NO” at step 3002), then the server transmits the advertisement information to the terminal on the receiving side (step 3004) without embedding it in the encoded voice code data. The server then repeats the above operation until transmission is completed (step 3005).
This embodiment has been described with regard to a case where advertisement information is embedded. However, the information is not limited to advertisement information; any information can be embedded. Further, it can be so arranged that by inserting an IP telephone address together with advertisement information, the destination of this IP telephone address can be telephoned to input detailed advertisement information and other detailed information by a single click.
In accordance with the digital voice communication system of
(f) Information Storage System
The side of the center 500 that is for storing digital voice includes an additional-information embedding unit 510 for embedding additional information in encoded voice code data that has been sent from the terminal A and storing the resultant data in a voice data storage unit 520, and an additional-data extraction unit 530 for extracting embedded information from prescribed encoded voice code data that has been read out of the voice data storage unit 520, displaying the extracted information on the display unit of a control panel 540 and inputting the encoded voice code data to a voice decoder 550. The latter decodes the entered encoded voice code data and outputs reproduced voice from a speaker 560.
The additional-information embedding unit 510 includes an additional-data generating unit 511 for encoding, and inputting to an embedding unit 512 as additional information, the sender name, recipient name, receive time and call category (classified by complaint, consultation and repair request, etc.) that enter from the control panel 540. In accordance with an embedding criterion identical with that of the embodiment of
The additional-data extraction unit 530 includes an extraction unit 531. In accordance with an embedding criterion identical with that of the embodiment of
Furthermore, when encoded voice code data is read out of the voice data storage unit 520, desired encoded voice code data can be retrieved and output using the embedded information. Specifically, a search keyword, e.g., the sender name, is input from the control panel 540, thereby instructing output of the voice file in which this sender name has been embedded. As a result, the extraction unit 531 retrieves the voice file in which the specified sender name has been embedded, outputs the embedded information, inputs the encoded voice code data to the voice decoder 550 and outputs decoded voice from the speaker.
In accordance with the embodiment of
Thus, in accordance with the present invention, data can be embedded in encoded voice code on the side of an encoder side and extracted correctly on the side of a decoder without both the encoder and decoder sides possessing a key.
Further, in accordance with the present invention, there is almost no decline in sound quality even if data is embedded in encoded voice code, thereby making the embedding of data concealed to the listener of reproduced voice.
Further, in accordance with the present invention, it is possible to embed and extract data if only an initial value of a threshold value is defined beforehand on both sending and receiving sides.
Further, in accordance with the present invention, if a control code is defined as embedded data, a threshold value can be changed using this control code and the amount of embedded data transmitted can be adjusted without transmitting additional information on another path.
Further, in accordance with the present invention, whether to embed only a data sequence, or whether to embed a data/control code sequence in a format that makes it possible to identify the type of data and control code, is decided in dependence upon a gain value. In a case where only a data sequence is embedded, therefore, it is unnecessary to include data-type information. This makes possible improvements relating to transmission capacity.
Further, in accordance with the present invention, it is possible to embed any data without changing the encoding format. In other words, an ID or other media information can be embedded in voice information and transmitted/stored without sacrificing the compatibility that is essential in communication/storage applications and without the user being aware. In addition, according to the present invention, control specifications are stipulated by parameters common to CELP. This means that the invention is not limited to a specific scheme and can be applied to a wide range of schemes. For example, G.729 suited to VoIP and AMR suited to mobile communications can be supported.
Further, in accordance with a digital voice communication system according to the present invention, it is so arranged that any code is embedded in a specific segment of a portion of compressed voice data at the transmitting end or along the way, and the embedded code is extracted from the specific segment by analyzing transmit voice data at the receiving end or along the way. As a result, additional information can be transmitted at the same time as voice using the ordinary voice transmission protocol as is. Further, since the additional information is embedded under the voice data, there is no auditory overlap, the additional information is not obtrusive and does not result in abnormal sounds. Further, multimedia communication becomes possible by adopting image information (video of present surroundings and map images, etc.) and personal information (a portrait photograph or voice print), etc., as the additional information. Further, by adopting a terminal serial number or voice print, etc., as the additional information, the performance of authentication as to whether or not an individual is an authorized user can be enhanced. Moreover, it is possible to improve the security of voice data.
Further, in accordance with the present invention, a server apparatus for relaying voice data is provided. As a result, optional information such as advertisement information can be provided to end users performing mutual communication of voice data.
Further, in accordance with the present invention, sender, recipient, receive time and call category, etc., are embedded in received voice data, which is then stored in storage means. This makes it possible to put voice data into file form so that subsequent utilization can be. facilitated.
As many apparently widely different embodiments of the present invention can be made without departing from the spirit and scope thereof, it is to be understood that the invention is not limited to the specific embodiments thereof except as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2002-026958 | Feb 2002 | JP | national |
2003-015538 | Jan 2003 | JP | national |
This application is a continuation-in-part of our application Ser. No. 10/278,108 filed on Oct. 22, 2002 now abandoned, the disclosure of which is hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
5195137 | Swaminathan | Mar 1993 | A |
5862260 | Rhoads | Jan 1999 | A |
6154484 | Lee et al. | Nov 2000 | A |
6314192 | Chen et al. | Nov 2001 | B1 |
6484139 | Yajima | Nov 2002 | B2 |
6901209 | Cooper et al. | May 2005 | B1 |
6996522 | Chen | Feb 2006 | B2 |
20010002902 | Hamdi | Jun 2001 | A1 |
20040019480 | Sato et al. | Jan 2004 | A1 |
20040024594 | Lee et al. | Feb 2004 | A1 |
Number | Date | Country |
---|---|---|
0 909 081 | Apr 1999 | EP |
1 020 848 | Jul 2000 | EP |
1 049 259 | Nov 2000 | EP |
11-296200 | Oct 1999 | JP |
2000-209663 | Jul 2000 | JP |
9609708 | Mar 1996 | WO |
WO9716917 | May 1997 | WO |
WO 0167671 | Sep 2001 | WO |
Number | Date | Country | |
---|---|---|---|
20030154073 A1 | Aug 2003 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10278108 | Oct 2002 | US |
Child | 10357323 | US |