The present invention relates to a technology of embedded encoding/decoding of a sound signal having a plurality of channels and a sound signal having one channel.
As a technology of embedded encoding/decoding of a sound signal having a plurality of channels and a monaural sound signal, there is a technology of Non-Patent Literature 1. The summary of the technology of Non-Patent Literature 1 will be described with an encoding device 500 illustrated in
As a monaural encoding/decoding scheme by which a high-quality monaural decoded sound signal can be obtained, there is a 3GPP EVS standard encoding/decoding scheme described in Non-Patent Literature 2. By using a high-quality monaural encoding/decoding scheme like that of Non-Patent Literature 2, as the monaural encoding/decoding scheme of Non-Patent Literature 1, there is a possibility that a higher-quality embedded encoding/decoding of a sound signal having a plurality of channels and a monaural sound signal can be realized.
Non-Patent Literature 1: Jeroen Breebaart et al., “Parametric Coding of Stereo Audio”, EURASIP Journal on Applied Signal Processing, pp. 1305-1322, 2005:9.
Non-Patent Literature 2: 3GPP, “Codec for Enhanced Voice Services (EVS); Detailed algorithmic description”, TS 26.445.
The upmix processing of Non-Patent Literature 1 is signal processing in the frequency domain that includes processing of applying a window having overlap between adjacent frames to a monaural decoded sound signal. The monaural encoding/decoding scheme of Non-Patent Literature 2 also includes processing of applying a window having overlap between adjacent frames. That is, as for a predetermined range of a boundary part of frames, a decoded sound signal is obtained by combining a signal which is obtained by applying an inclined window in an attenuating shape to a signal obtained by decoding a code of a preceding frame and a signal which is obtained by applying an inclined window in an increasing shape to a signal obtained by decoding a code of a following frame, on both of the decoding side of the stereo encoding/decoding scheme of Non-Patent Literature 1 and the decoding side of the monaural encoding/decoding scheme of Non-Patent Literature 2. From the above, there is a problem that, when a monaural encoding/decoding scheme like that of Non-Patent Literature 2 is used as a monaural encoding/decoding scheme for embedded encoding/decoding like that of Non-Patent Literature 1, a stereo decoded sound signal delays with respect to a monaural decoded sound signal by an amount corresponding to the window in the upmix processing, that is, the algorithmic delay of the stereo encoding/decoding is larger than that of the monaural encoding/decoding.
For example, in a multipoint control unit (MCU) for performing a conference call at many places, it is common to control to switch that a signal from which place is to be outputted to which place, for each predetermined time section, and it is difficult to control in a state where a stereo decoded sound signal delays with respect to a monaural decoded sound signal by an amount corresponding to the window in the upmix processing. Thus, it is assumed that the implementation is such that the control is performed in a state where the stereo decoded sound signal is delayed by one frame with respect to the monaural decoded sound signal. That is, in a communication system that includes a multipoint control unit, the problem described above becomes more prominent, and there is a possibility that the algorithmic delay of stereo encoding/decoding becomes larger than the algorithmic delay of monaural encoding/decoding by one frame. Further, though it becomes possible to control the switching for each predetermined time section by delaying the stereo decoded sound signal with respect to the monaural decoded sound signal by one frame, there is a possibility that the control about a monaural decoded sound signal from which place and a stereo decoded sound signal from which place are to be combined and outputted for each time section becomes complicated because the monaural decoded sound signal and the stereo decoded sound signal have different delays.
The present invention has been made in view of such a problem, and its objective is to provide such embedded encoding/decoding of a sound signal having a plurality of channels and a monaural sound signal that the algorithmic delay of stereo encoding/decoding is not larger than the algorithmic delay of monaural encoding/decoding.
In order to solve the above problem, a sound signal decoding method as one aspect of the present invention is a sound signal decoding method for decoding an inputted code for each frame to obtain a decoded sound signal having C channels (C is an integer of 2 or larger), the sound signal decoding method comprising: as processing of a current frame, a monaural decoding step of decoding a monaural code included in the inputted code by a decoding scheme that includes processing of applying a window having overlap between frames to obtain a monaural decoded sound signal; an additional decoding step of decoding an additional code included in the inputted code to obtain an additional decoded signal, which is a monaural decoded signal for a section X, which is a section corresponding to the overlap between the current frame and an immediately following frame; and a stereo decoding step of obtaining a decoded downmix signal, which is a concatenation of a part of the monaural decoded sound signal for a section except the section X and the additional decoded signal for the section X, and obtaining and outputting the decoded sound signal having the C channels from the decoded downmix signal by upmix processing using a characteristic parameter obtained from a stereo code included in the inputted code.
According to the present invention, it is possible to provide such embedded encoding/decoding of a sound signal having a plurality of channels and a monaural sound signal that the algorithmic delay of stereo encoding/decoding is not larger than the algorithmic delay of monaural encoding/decoding.
Before describing each embodiment, each signal and algorithmic delay in encoding/decoding of the background art and a first embodiment will be described first, with reference to
Hereinafter, the section from t5 to t6, which is the overlap section between the current frame and the immediately following frame, will be called “section X”. That is, on the encoding side, the section X is a section for which the monaural encoding unit 120 encodes a monaural sound signal to which a window is applied, in both of the processing of the current frame and processing of the immediately following frame. More specifically, the section X is a section with a predetermined length of a sound signal that includes an end of the sound signal that the monaural encoding unit 120 encodes in the processing of the current frame, a section for which the monaural encoding unit 120 encodes a sound signal, to which the window in the attenuating shape is applied, in the processing of the current frame, a section with the predetermined length including the beginning of the section encoded by the monaural encoding unit 120 in the processing of the immediately following frame, and a section for which the monaural encoding unit 120 encodes a sound signal, to which the window in the increasing shape is applied, in the processing of the immediately following frame. Further, on the decoding side, the section X is a section for which the monaural decoding unit 210 decodes the monaural code CM to obtain a decoded sound signal to which a window is applied, in both of the processing of the current frame and the processing of the immediately following frame. More specifically, the section X is a section with the predetermined length of a decoded sound signal that the monaural decoding unit 210 obtains by decoding the monaural code CM in the processing of the current frame, that includes the end of the decoded sound signal, a section of the decoded sound signal that the monaural decoding unit 210 obtains by decoding the monaural code CM in the processing of the current frame, to which a window in the attenuating shape is applied, a section with the predetermined length of a decoded sound signal that the monaural decoding unit 210 obtains by decoding the monaural code CM in the processing of the immediately following frame, that includes the start of the decoded sound signal, a section of the decoded sound signal that the monaural decoding unit 210 obtains by decoding the monaural code CM in the processing of the immediately following frame, to which a window in the increasing shape is applied, and a section for which the monaural decoding unit 210 obtains a decoded sound signal by combining the decoded sound signal already obtained by decoding the monaural code CM in the processing of the current frame and the decoded sound signal obtained by decoding the monaural code CM in the processing of the immediately following frame, in the processing of the immediately following frame.
Further, hereinafter, the section t1 to t5, which is a section except the section X in a section for which monaural encoding/decoding is performed in the processing of the current frame, will be called “section Y”. That is, the section Y is, on the encoding side, a part of the section for which the monaural sound signal is encoded by the monaural encoding unit 120 in the processing of the current frame except the overlap section between the current frame and the immediately following frame, and on the decoding side, a part of the section for which the monaural code CM is decoded by the monaural decoding unit 210 to obtain a decoded sound signal in the processing of the current frame except the overlap section between the current frame and the immediately following frame. Since the section Y is a concatenation of a section where the monaural sound signal is represented by the monaural code CM of the current frame and the monaural code CM of the immediately previous frame and a section where a monaural sound signal is represented only by the monaural code CM of the current frame, the section Y is a section for which the monaural decoded sound signal can be completely obtained in processing up to the processing of the current frame.
An encoding device and a decoding device of the first embodiment will be described.
As shown in
From the two-channel stereo input sound signal inputted to the encoding device 100, the stereo encoding unit 110 obtains and outputs the stereo code CS representing a characteristic parameter, which is a parameter representing a characteristic of difference between the inputted sound signals of the two channels, and a downmix signal which is a signal obtained by mixing the sound signals of the two channels (step S111).
As an example of the stereo encoding unit 110, an operation of the stereo encoding unit 110 for each frame when taking information representing strength difference between the inputted sound signals of the two channels for each frequency band as the characteristic parameter will be described. Note that, though a specific example using a complex DFT (Discrete Fourier Transformation) is described below, a well-known method for conversion to the frequency domain other than the complex DFT may be used. Note that, in the case of converting such a sample sequence, whose number of samples is not a power of two, into the frequency domain, a well-known technology, such as using a sample sequence with zero stuffing so that the number of samples becomes a power of two, can be used.
First, the stereo encoding unit 110 performs complex DFT for each of the inputted sound signals of the two channels to obtain a complex DFT coefficient sequence (step S111-1). The complex DFT coefficient sequence is obtained by applying a window having overlap between frames and using processing in consideration of symmetry of complex numbers obtained by complex DFT. For example, when the sampling frequency is 32 kHz, the processing is performed each time sound signals of the two channels, each of which has 640 samples corresponding to 20 ms, are inputted; and, for each channel, it is enough to obtain a sequence of 372 complex numbers corresponding to the former half of a sequence of 744 complex numbers to be obtained by performing complex DFT for a digital sound signal sample sequence of successive 744 samples (in the case of the example described above, a sample sequence of the section from t1 to t6) as the complex DFT coefficient sequence; which 744 samples includes 104 samples overlapping with a sample group at the end of the immediately previous frame (in case of the example described above, samples of the section from t1 to t2) and 104 samples overlapping with a sample group at the beginning of the immediately following frame (in the case of the example described above, samples of the section from t5 to t6) . Hereinafter, “f” indicates each of integers from 1 to 372; each complex DFT coefficient of a complex DFT coefficient sequence of the first channel is indicated by V1(f); and each complex DFT coefficient of a complex DFT coefficient sequence of the second channel is indicated by V2(f). Next, from the complex DFT coefficient sequences of the two channels, the stereo encoding unit 110 obtains sequences of radiuses of the complex DFT coefficients on the complex plane (step S111-2). The radius of each complex DFT coefficient of each channel on the complex plane corresponds to strength of the sound signal of each channel for each frequency bin. Hereinafter, the radius of the complex DFT coefficient V1(f) of the first channel on the complex plane is indicated by V1r(f), and the radius of the complex DFT coefficient V2(f) of the second channel on the complex plane is indicated by V2r(f). Next, the stereo encoding unit 110 obtains an average of ratios of radiuses of one channel and radiuses of the other channel for each frequency band, and obtains a sequence of averages as the characteristic parameter (step S111-3). The sequence of averages is the characteristic parameter corresponding to the information representing the strength difference between the inputted sound signals of the two channels for each frequency band. For example, in the case of four bands, “f” of which being from 1 to 93, from 94 to 186, from 187 to 279 and from 280 to 372, the stereo encoding unit 110 obtain 93 values for each of four bands by dividing the radius V1r(f) of the first channel by the radius V2r(f) of the second channel, obtains averages thereof as Mr(1), Mr(2), Mr(3) and Mr(4), and obtains a series of average {Mr(1), Mr(2), Mr(3), Mr(4)} as the characteristic parameter.
Note that the number of bands is only required to be a value equal to or smaller than the number of frequency bins, and the same number as the number of frequency bins or 1 may be used as the number of bands. In the case of using the same value as the number of frequency bins as the number of bands, the stereo encoding unit 110 can obtain, for each frequency bin, a value of a ratio between a radius of one channel and a radius of the other channel and obtain a sequence of the obtained values of ratios as the characteristic parameter. In the case of using 1 as the number of bands, the stereo encoding unit 110 can obtain, for each frequency bin, a value of a ratio between a radius of one channel and a radius of the other channel and obtain an average of the obtained ratio values for all the bands as the characteristic parameter. Further, in case of adopting multiple bands, the number of frequency bins to be included in each frequency band is arbitrary. For example, the number of frequency bins to be included in a low-frequency band may be smaller than the number of frequency bins to be included in a high-frequency band.
Further, the stereo encoding unit 110 may use difference between a radius of one channel and a radius of the other channel, instead of the ratio between a radius of one channel and a radius of the other channel. That is, in the case of the example described above, the stereo encoding unit 110 may use a value obtained by subtracting the radius V2r(f) of the second channel from the radius V1r(f) of the first channel, instead of the value obtained by dividing the radius V1r(f) of the first channel by the radius V2r(f) of the second channel.
Furthermore, the stereo encoding unit 110 obtains the stereo code CS which is a code representing the characteristic parameter (step S111-4). The stereo code CS which is a code representing the characteristic parameter can be obtained by a well-known method. For example, the stereo encoding unit 110 performs vector quantization of the value sequence obtained at step S111-3 to obtain a code, and outputs the obtained code as the stereo code CS. Alternatively, for example, the stereo encoding unit 110 performs scalar quantization of each of the values included in the value sequence obtained at step S111-3 to obtain codes, and outputs the obtained codes together as the stereo code CS. Note that, in a case where what is obtained at step S111-3 is one value, the stereo encoding unit 110 can output a code obtained by scalar quantization of the one value, as the stereo code CS.
The stereo encoding unit 110 also obtains a downmix signal which is a signal obtained by mixing the sound signals of the two channels of the first channel and the second channel (step S111-5). For example, in the processing of the current frame, the stereo encoding unit 110 obtains a downmix signal, which is a monaural signal obtained by mixing the sound signals of the two channels, for 20 ms from t3 to t7. The stereo encoding unit 110 may mix the sound signals of the two channels in the time domain like step S111-5A described later or may mix the sound signals of the two channels in the frequency domain like step S111-5B described later. In the case of mixing in the time domain, for example, the stereo encoding unit 110 obtains a sequence of averages of corresponding samples between the sample sequence of the sound signal of the first channel and the sample sequence of the sound signal of the second channel, as the downmix signal which is a monaural signal obtained by mixing the sound signals of the two channels (step S111-5A). In the case of mixing in the frequency domain, for example, the stereo encoding unit 110 obtains a complex DFT coefficient sequence applying complex DFT to the sample sequence of the first channel sound signal, obtains a complex DFT coefficient sequence applying complex DFT to the sample sequence of the second channel sound signal, obtains a radius average VMr(f) and an angle average VMθ (f) from each complex DFT coefficient thereof, and obtains a sample sequence applying inverse complex DFT to a sequence of complex values VM(f) whose radius is VMr(f) and angle is VMθ (f) on the complex plane, as the downmix signal which is a monaural signal obtained by mixing the sound signals of the two channels (step S111-5B).
Note that, as indicated by two-dot chain lines in
The downmix signal outputted by the stereo encoding unit 110 is inputted to the monaural encoding unit 120. When the encoding device 100 is provided with the downmix unit 150, the downmix signal outputted by the downmix unit 150 is inputted to the monaural encoding unit 120. The monaural encoding unit 120 encodes the downmix signal by a predetermined encoding scheme to obtain and output the monaural code CM (step S121). As the encoding scheme, an encoding scheme that includes processing of applying a window having overlap between frames, for example, like the 13.2 kbps mode of the 3GPP EVS standard (3GPP TS26.445) of Non-Patent Literature 2 is used. In the case of the example described above, in the processing of the current frame, the monaural encoding unit 120 encodes the signal 1b, which is a signal for the section from t1 to t6 obtained by applying a window in a shape of increasing in the section from t1 to t2 where the current frame and the immediately previous frame overlap, attenuating in the section from t5 to t6 where the current frame and the immediately following frame overlap and being flat in the section from t2 to t5 between the above sections, to the signal 1a which is the downmix signal, using the section from t6 to t7 of the signal 1a which is the “look-ahead section” for analysis processing, to obtain and output the monaural code CM.
Thus, when the encoding scheme used by the monaural encoding unit 120 includes processing of applying a window having overlap and analysis processing using a “look-ahead section”, not only the downmix signal outputted by the stereo encoding unit 110 or the downmix unit 150 in the processing of the current frame but also a downmix signal outputted by the stereo encoding unit 110 or the downmix unit 150 in frame processing in the past is also used in encoding processing. Therefore, the monaural encoding unit 120 can be provided with a storage not shown to store downmix signals inputted in frame processing in the past so that the monaural encoding unit 120 can process encoding of the current frame using a downmix signal stored in the storage, too. Alternatively, the stereo encoding unit 110 or the downmix unit 150 may be provided with a storage not shown so that the stereo encoding unit 110 or the downmix unit 150 may output a downmix signal to be used by the monaural encoding unit 120 in encoding processing of the current frame, including a downmix signal obtained in frame processing in the past, in the processing of the current frame, and the monaural encoding unit 120 may use the downmix signals inputted from the stereo encoding unit 110 or the downmix unit 150 in the processing of the current frame. Note that storing signals obtained in frame processing in the past in a storage not shown and using a signal in the processing of the current frame, like the above processing, are also performed by each unit described later when necessary. Since it is well-known processing in the technological field of encoding, description thereof will be omitted below in order to avoid redundancy.
The downmix signal outputted by the stereo encoding unit 110 is inputted to the additional encoding unit 130. When the encoding device 100 is provided with the downmix unit 150, the downmix signal outputted by the downmix unit 150 is inputted to the additional encoding unit 130. The additional encoding unit 130 encodes a part of the inputted downmix signal for the section X to obtain and output the additional code CA (step S131). In the case of the example described above, the additional encoding unit 130 encodes the signal 5c, which is a downmix signal for the section from t5 to t6, to obtain and output the additional code CA. For the encoding, an encoding scheme such as well-known scalar quantization or vector quantization can be used.
As shown in
The monaural code CM included among the codes inputted to the decoding device 200 is inputted to the monaural decoding unit 210. The monaural decoding unit 210 obtains and outputs the monaural decoded sound signal for the section Y using the inputted monaural code CM (step S211). As a predetermined decoding scheme, a decoding scheme corresponding to the encoding scheme used by the monaural encoding unit 120 of the encoding device 100 is used. In the case of the example described above, the monaural decoding unit 210 decodes the monaural code CM of the current frame by the predetermined decoding scheme to obtain the signal 2a for the 23.25 ms section from t1 to t6, to which the window in the shape of increasing in the 3.25 ms section from t1 to t2, being flat in the 16.75 ms section from t2 to t5 and attenuating in the 3.25 ms section from t5 to t6 is applied. By combining the signal 2b obtained from the monaural code CM of the immediately previous frame in the processing of the immediately previous frame and the signal 2a obtained from the monaural code CM of the current frame for the section from t1 to t2 and using the signal 2a obtained from the monaural code CM of the current frame as it is for a section from t2 to t5, the monaural decoding unit 210 obtains and outputs the signal 2d, which is the monaural decoded sound signal for 20 ms section from t1 to t5. Note that, since the signal 2a for the section from t5 to t6 obtained from the monaural code CM of the current frame is used as “the signal 2b obtained from processing of an immediately previous frame” in the processing of the immediately following frame, the monaural decoding unit 210 stores the signal 2a for the section from t5 to t6 obtained from the monaural code CM of the current frame into a storage not shown in the monaural decoding unit 210.
The additional code CA included among the codes inputted to the decoding device 200 is inputted to the additional decoding unit 230. The additional decoding unit 230 decodes the additional code CA to obtain and output an additional decoded signal which is the monaural decoded sound signal for the section X (step S231). For the decoding, a decoding scheme corresponding to the encoding scheme used by the additional encoding unit 130 is used. In the example described above, the additional decoding unit 230 decodes the additional code CA of the current frame to obtain and output the signal 4b which is the monaural decoded sound signal for the 3.25 ms section from t5 to t6.
The monaural decoded sound signal outputted by the monaural decoding unit 210, the additional decoded signal outputted by the additional decoding unit 230 and the stereo code CS included among the codes inputted to the decoding device 200 are inputted to the stereo decoding unit 220. From the inputted monaural decoded sound signal, additional decoded signal and stereo code CS, the stereo decoding unit 220 obtains and outputs a stereo decoded sound signal, which is a decoded sound signal having the two channels (step S221). More specifically, the stereo decoding unit 220 obtains a decoded downmix signal for a section Y+X which is a signal obtained by concatenating the monaural decoded sound signal for the section Y and the additional decoded signal for the section X (that is, a section obtained by concatenating the section Y and the section X) (step S221-1), and obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal obtained at step S221-1 by upmix processing using the characteristic parameter obtained from the stereo code CS (step S221-2). The upmix processing is processing of obtaining the decoded sound signals of the two channels, regarding the decoded downmix signal as a signal obtained by mixing the decoded sound signals of the two channels and regarding the characteristic parameter obtained from the stereo code CS as information representing the characteristic of difference between the decoded sound signals of the two channels. The same goes for each embodiment described later. In the case of the example described above, first, the stereo decoding unit 220 obtains the decoded downmix signal for the 23.25 ms section from t1 to t6 (the section from t1 to t6 of the signal 4c) by concatenating the monaural decoded sound signal for the 20 ms section from t1 to t5 outputted by the monaural decoding unit 210 (the section from t1 to t5 of the signals 2d and 3a) and an additional decoded signal (the signal 4b) for the 3.25 ms section from t5 to t6 outputted by the additional decoding unit 230. Next, regarding the decoded downmix signal for the section from t1 to t6 as a signal obtained by mixing the decoded sound signals of the two channels and regarding the characteristic parameter obtained from the stereo code CS as information representing the characteristic of the difference between the decoded sound signals of the two channels, the stereo decoding unit 220 obtains and outputs the decoded sound signals of the two channels for the 20 ms section from t1 to t5 (signals 4h-1 and 4h-2).
Step S221-2 performed by the stereo decoding unit 220 when the characteristic parameter is information representing the strength difference between the sound signals of the two channels for each frequency band will be described as an example of step S221-2 performed by the stereo decoding unit 220. First, the stereo decoding unit 220 decodes the inputted stereo code CS to obtain the information representing the strength difference for each frequency band (S221-21). The stereo decoding unit 220 obtains the characteristic parameter from the stereo code CS by a scheme corresponding to the scheme by which the stereo encoding unit 110 of the encoding device 100 obtained the stereo code CS from the information representing the strength difference for each frequency band. For example, the stereo decoding unit 220 performs vector decoding of the inputted stereo code CS to obtain element values of a vector corresponding to the inputted stereo code CS as information representing strength differences for a plurality of frequency bands, respectively. Alternatively, for example, the stereo decoding unit 220 performs scalar decoding of each of codes included in the inputted stereo code CS to obtain the information representing the strength difference for each frequency band. Note that, in a case where the number of bands is one, the stereo decoding unit 220 performs scalar decoding of the inputted stereo code CS to obtain information representing the strength difference for the one frequency band, that is, for the whole band.
Next, regarding the decoded downmix signal as a signal obtained by mixing the decoded sound signals of the two channels and regarding the characteristic parameter as the information representing strength difference between the decoded sound signals of the two channels for each frequency band, the stereo decoding unit 220 obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal obtained at step S221-1 and the characteristic parameter obtained at step S221-21 (step S220-22). When the stereo encoding unit 120 of the encoding device 100 operates in the above-stated specific example using complex DFT, the stereo decoding unit 220 operates at step S221-22 as follows.
First, the stereo decoding unit 220 obtains the signal 4d obtained by applying the window in the shape of increasing in the 3.25 ms section from t1 to t2, being flat in the 16.75 ms section from t2 to t5 and attenuating in the 3.25 ms section from t5 to t6 to a decoded downmix signal with 744 samples for the 23.25 ms section from t1 to t6 (step S221-221). Next, the stereo decoding unit 220 obtains a sequence of 372 complex numbers of the former half of a sequence of 744 complex numbers to be obtained by performing complex DFT to the signal 4d as a complex DFT coefficient sequence (a monaural complex DFT coefficient sequence) (step S221-222). Hereinafter, each complex DFT coefficient of the monaural complex DFT coefficient sequence obtained by the stereo decoding unit 220 is indicated by MQ(f). Next, the stereo decoding unit 220 obtains a radius MQr(f) of each complex DFT coefficient on the complex plane and an angle MQθ (f) of each complex DFT coefficient on the complex plane from the monaural complex DFT coefficient sequence (step S221-223). Next, the stereo decoding unit 220 obtains a value by multiplying each radius MQr(f) by a square root of a corresponding value in the characteristic parameter, as each radius VLQr(f) of the first channel, and obtains a value by dividing each radius MQr(f) by a square root of a corresponding value in the characteristic parameter, as each radius VRQr(f) of the second channel (step S221-224). In the case of the example of the four bands described above, the corresponding value in the characteristic parameter for each frequency bin is Mr(1) when “f” is 1 to 93, Mr(2) when “f” is 94 to 186, Mr(3) when “f” is 187 to 279 and Mr(4) when “f” is 280 to 372. Note that, when the stereo encoding unit 110 of the encoding device 100 uses the difference between the radius of the first channel and the radius of the second channel instead of the ratio between the radius of the first channel and the radius of the second channel, the stereo decoding unit 220 can obtain a value by adding a value obtained by dividing a corresponding value in the characteristic parameter by 2 to each radius MQr(f) as each radius VLQr(f) of the first channel and obtain a value by subtracting the value obtained by dividing a corresponding value in the characteristic parameter by 2 from each radius MQr(f) as each radius VRQr(f) of the second channel. Next, the stereo decoding unit 220 performs inverse complex DFT to the sequence of such complex numbers that the radius and angle on the complex plane are VLQr(f) and MQθ (f), respectively, to obtain a decoded sound signal of the first channel with the 744 samples for the 23.25 ms section from t1 to t6 (the signal 4e-1) to which a window is applied, and performs inverse complex DFT to the sequence of such complex numbers that the radius and angle on the complex plane are VRQr(f) and MQθ (f), respectively, to obtain a decoded sound signal of the second channel with the 744 samples for the 23.25 ms section from t1 to t6 (the signal 4e-2) (step S221-225) to which a window is applied. The decoded sound signals of the channels to which the window is applied obtained at step S221-225 (the signals 4e-1 and 4e-2) are signals to which the window in the shape of increasing in the 3.25 ms section from t1 to t2, being flat in the 16.75 ms section from t2 to t5 and attenuating in the 3.25 ms section from t5 to t6 is applied. Next, the stereo decoding unit 220 obtains and outputs the decoded sound signals for the 20 ms section from t1 to t5 (the signal 4h-1 and 4h-2) by combining the signals obtained at step S221-225 for the immediately previous frame (the signal 4f-1 and 4f-2) and the signals obtained at step S221-225 for the current frame (the signals 4e-1 and 4e-2) for the section from t1 to t2, respectively, and using the signals obtained at step S221-225 (the signals 4e-1 and 4e-2) for the current frame as they are for the section from t2 to t5, for the first and second channels, respectively (step S221-226).
Difference between the downmix signal and a locally decoded signal of monaural encoding for the section Y, which is a time section for which the monaural decoding unit 210 can obtain a complete monaural decoded sound signal from the monaural code CM, may also be an encoding target of the additional encoding unit 130. This embodiment is regarded as a second embodiment, and points different from the first embodiment will be described.
In addition to encoding the downmix signal by the predetermined encoding scheme to obtain and output the monaural code CM, the monaural encoding unit 120 also obtains and outputs a signal obtained by decoding the monaural code CM, that is, a monaural locally decoded signal which is a locally decoded signal of the downmix signal for the section Y (step S122). In the case of the example described above, in addition to obtaining the monaural code CM of the current frame, the monaural encoding unit 120 obtains a locally decoded signal corresponding to the monaural code CM of the current frame, that is, a locally decoded signal to which the window in the shape of increasing in the 3.25 ms section from t1 to t2, being flat in the 16.75 ms section from t2 to t5 and attenuating in the 3.25 ms section from t5 to t6 is applied. By combining a locally decoded signal corresponding to the monaural code CM of the immediately previous frame and the locally decoded signal corresponding to the monaural code CM of the current frame, for the section from t1 to t2, and using the locally decoded signal corresponding to the monaural code CM of the current frame as it is, for the section from t2 to t5, the monaural encoding unit 120 obtains and outputs a locally decoded signal for the 20 ms section from t1 to t5. As the locally decoded signal corresponding to the monaural code CM of the immediately previous frame for the section from t1 to t2, a signal stored in the storage not shown in the monaural encoding unit 120 is used. Since a part of the locally decoded signal corresponding to the monaural code CM of the current frame for the section from t5 to t6 is used as “a locally decoded signal corresponding to the monaural code CM of an immediately previous frame” in the processing of the immediately following frame, the monaural encoding unit 120 stores the locally decoded signal for the section from t5 to t6 obtained from the monaural code CM of the current frame into the storage not shown in the monaural encoding unit 120.
In addition to the downmix signal, the monaural locally decoded signal outputted by the monaural encoding unit 120 is also inputted to the additional encoding unit 130 as shown by a dashed line in
First, the additional encoding unit 130 encodes the inputted downmix signal for the section X to obtain the first additional code CA1 (step S132-1; hereinafter also referred to as “first additional encoding”) and obtains a locally decoded signal for the section X corresponding to the first additional code CA1, that is, a locally decoded signal of the first encoding for the section X (step S132-2). For the first additional encoding, an encoding scheme such as well-known scalar quantization or vector quantization can be used. Next, the additional encoding unit 130 obtains a difference signal (a signal configured by subtraction between sample values of corresponding samples) between the inputted downmix signal for the section X and the locally decoded signal for the section X obtained at step S132-2 (step S132-3). Further, the additional encoding unit 130 obtains a difference signal (a signal configured by subtraction between sample values of corresponding samples) between the downmix signal and the monaural locally decoded signal for the section Y (step S132-4). Next, the additional encoding unit 130 encodes a signal obtained by concatenating the difference signal for the section Y obtained at step S132-4 and the difference signal for the section X obtained at step S132-3 to obtain the second additional code CA2 (step S132-5; hereinafter also referred to as “second additional encoding”). For the second additional encoding, an encoding scheme of collectively encoding a sample sequence obtained by concatenating a sample sequence of the difference signal for the section Y obtained at step S132-4 and a sample sequence of the difference signal for the section X obtained at step S131-3, for example, an encoding scheme using prediction in the time domain or an encoding scheme adapted to amplitude imbalance in the frequency domain is used. Next, the additional encoding unit 130 outputs a concatenation of the first additional code CA1 obtained at step S132-1 and the second additional code CA2 obtained at step S132-5 as the additional code CA (step S132-6).
Note that the additional encoding unit 130 may target a weighted difference signal for encoding instead of the difference signal described above. That is, the additional encoding unit 130 may code a weighted difference signal (a signal configured by weighted subtraction between sample values of corresponding samples) between the downmix signal and the monaural locally decoded signal for the section Y and the downmix signal for the section X to obtain and output the additional code CA. In the case of [Specific example 1 of additional encoding unit 130], the additional encoding unit 130 can obtain the weighted difference signal (a signal configured by weighted subtraction between sample values of corresponding samples) between the downmix signal and the monaural locally decoded signal for the section Y as the processing of step S132-4. Similarly, the additional encoding unit 130 may obtain the weighted difference signal (a signal configured by weighted subtraction between sample values of corresponding samples) between the inputted downmix signal for the section X and the locally decoded signal for the section X obtained at step S132-2 as the processing of step S132-3 of [Specific example 1 of additional encoding unit 130]. In these cases, the weight used for generation of each weighted difference signal can be encoded by a well-known encoding technology to obtain a code, and the obtained code (a code representing the weight) can be included in the additional code CA. Though the above is the same for each difference signal in each embodiment described later, it is well known in the technological field of encoding that a weighted difference signal is targeted by encoding instead of a difference signal and that, in that case, the weight is also coded. Therefore, in the embodiments described later, each individual description will be omitted to avoid redundancy, and only description in which a difference signal and a weighted difference signal are written together using “or” and description in which subtraction and weighted subtraction are written together using “or” will be made.
The additional decoding unit 230 decodes the additional code CA to obtain and output not only the additional decoded signal for the section X, which is an additional decoded signal obtained by the additional decoding unit 230 of the first embodiment but also an additional decoded signal for the section Y (step S232). For the decoding, a decoding scheme corresponding to the encoding scheme used by the additional encoding unit 130 at step S132 is used. That is, if the additional encoding unit 130 uses [Specific example 1 of additional encoding unit 130] at step S132, the additional decoding unit 230 performs the processing of [Specific example 1 of additional decoding unit 230] described below.
First, the additional decoding unit 230 decodes the first additional code CA1 included in the additional code CA to obtain a first decoded signal for the section X (step S232-1; hereinafter also referred to as “first additional decoding”). For the first additional decoding, a decoding scheme corresponding to the encoding scheme used for the first additional encoding by the additional encoding unit 130 is used. Further, the additional decoding unit 230 decodes the second additional code CA2 included in the additional code CA to obtain a second decoded signal for the sections Y and X (step S232-2; hereinafter also referred to as “second additional decoding”). For the second additional decoding, a decoding scheme corresponding to the encoding scheme used by the additional encoding unit 130 for the second additional encoding is used, that is, a decoding scheme that yields, from the code, a collective sample sequence which is a concatenation of a sample sequence of the additional decoded signal for the section Y and a sample sequence of the second decoded signal for the section X; for example, a decoding scheme using prediction in the time domain or a decoding scheme adapted to amplitude imbalance in the frequency. Next, the additional decoding unit 230 obtains a part of the second decoded signal obtained at step S232-2 for the section Y as the additional decoded signal for the section Y, obtains an addition signal (a signal configured by addition of sample values of corresponding samples) of the first decoded signal for the section X obtained at step S232-1 and a part of the second decoded signal for the section X obtained at step S232-2 as the additional decoded signal for the section X, and outputs the additional decoded signals for the sections Y and X (step S232-4).
Note that, when the additional encoding unit 130 targets not a difference signal but a weighted difference signal for encoding, a code representing a weight is also included in the additional code CA, and thus the additional decoding unit 230 can decode a code of the additional code CA except the code representing the weight to obtain and output an additional decoded signal, and decode the code representing the weight, which is included in the additional code CA, to obtain and output the weight at step S232 described above. In the case of [Specific example 1 of additional decoding unit 230], a code representing a weight for the section X, which is included in the additional code CA, is decoded to obtain the weight for the section X; and, at step S232-4, the additional decoding unit 230 can obtain the part of the second decoded signal obtained at step S232-2 for the section Y as the additional decoded signal for the section Y, obtain a weighted addition signal (a signal configured by weighted addition of sample values of corresponding samples) of the first decoded signal for the section X obtained at step S232-1 and the part of the second decoded signal obtained at step S232-2 for the section X as the additional decoded signal for the section X, and output the additional decoded signals for the sections Y and X and a weight for the section Y obtained by decoding a code representing the weight for the section Y, which is included in the additional code CA. Though the above is the same for addition of signals in each embodiment described later, it is well known in the technological field of encoding that weighted addition (generation of a weighted sum signal) is performed instead of addition (generation of a sum signal) and that, in that case, a weight is also obtained from a code. Therefore, in the embodiments described later, respective descriptions will be omitted to avoid redundancy, and only description in which addition and weighted addition are written together using “or” and description in which a sum signal and a weighted sum signal are written together using “or” will be made.
The stereo decoding unit 220 performs steps S222-1 and S222-2 below (step S222). The stereo decoding unit 220 obtains a signal concatenating a sum signal (a signal configured by addition of sample values of corresponding samples) of the monaural decoded sound signal for the section Y and the additional decoded signal for the section Y and the additional decoded signal for the section X, as a decoded downmix signal for the section Y+X (step S222-1) instead of step S221-1 performed by the stereo decoding unit 220 of the first embodiment and obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal obtained at step S221-2 by the upmix processing using the characteristic parameter obtained from the stereo code CS, using the decoded downmix signal obtained at step S222-1 instead of the decoded downmix signal obtained at step S221-1 (step S222-2).
Note that, when the additional encoding unit 130 targets not a difference signal but a weighted difference signal for encoding, the stereo decoding unit 220 can obtain a signal concatenating a weighted sum signal (a signal configured by weighted addition of sample values of corresponding samples) of the monaural decoded sound signal for the section Y and the additional decoded signal for the section Y and the additional decoded signal for the section X, as the decoded downmix signal for the section X+Y at step S222-1. For generation of the weighted sum signal (weighted addition of sample values of corresponding samples) of the monaural decoded sound signal for the section Y and the additional decoded signal for the section Y, the weight for the section Y outputted by the additional decoding unit 230 can be used. Though the above is the same for addition of signals in each embodiment described later, it is well known in the technological field of encoding that weighted addition (generation of a weighted sum signal) is performed instead of addition (generation of a sum signal) and that, in that case, a weight is also obtained from a code as described in the description of the additional decoding unit 230. Therefore, in the embodiments described later, respective description will be omitted to avoid redundancy, and only description in which addition and weighted addition are written together using “or” and description in which a sum signal and a weighted sum signal are written together using “or” will be made.
According to the second embodiment, in addition to that the algorithmic delay of stereo encoding/decoding is not larger than the algorithmic delay of monaural encoding/decoding, it is possible to cause the quality of a decoded downmix signal used for stereo decoding to be higher than the first embodiment. Therefore, it is possible to cause the sound quality of a decoded sound signal of each channel obtained by stereo decoding to be high. That is, in the second embodiment, the monaural encoding processing by the monaural encoding unit 120 and the additional encoding processing by the additional encoding unit 130 are used as encoding processing for high-quality encoding of a downmix signal; the monaural code CM and the additional code CA are obtained as codes that represents the favorable downmix signal; and the monaural decoding processing by the monaural decoding unit 210 and the additional decoding processing by the additional decoding unit 230 are used as decoding processing for obtaining a high-quality decoded downmix signal. A code amount assigned to each of the monaural code CM and the additional code CA can be arbitrarily determined according to purposes. In the case of realizing higher-quality stereo encoding/decoding in addition to standard-quality monaural encoding/decoding, a larger code amount can be assigned to the additional code CA. That is, “monaural code” and “additional code” are mere convenient names from a viewpoint of stereo encoding/decoding. Since each of the monaural code CM and the additional code CA is a part of a code representing a downmix signal, one of the codes may be called “a first downmix code”, and the other may be called “a second downmix code”. When it is assumed that a larger code amount is assigned to the additional code CA, the additional code CA may be called “a downmix code”, “a downmix signal code” or the like. The above is the same in a third embodiment and each embodiment based on the second embodiment described after the third embodiment.
There may be a case where it is possible for the stereo decoding unit 220 to obtain the decoded sound signals of the two channels with a higher sound quality by using a decoded downmix signal corresponding to a downmix signal obtained by mixing sound signals of the two channels in the frequency domain, and it is possible for the monaural decoding unit 210 to obtain a monaural decoded sound signal with a higher sound quality when the monaural encoding unit 120 encodes a signal obtained by mixing sound signals of the two channels in the time domain. In such a case, it is preferable that the stereo encoding unit 110 obtains the downmix signal by mixing the sound signals of the two channels inputted to the encoding device 100 in the frequency domain, that the monaural encoding unit 120 encodes a signal obtained by mixing the sound signals of the two channels inputted to the encoding device 100 in the time domain, and that the additional encoding unit 130 also encodes difference between the signal obtained by mixing the sound signals of the two channels in the frequency domain and the signal obtained by mixing the sound signals of the two channels in the time domain. This embodiment is regarded as a third embodiment, and points different from the second embodiment will be mainly described.
The stereo encoding unit 110 operates as described in the first embodiment similarly to the stereo encoding unit 110 of the second embodiment. However, as for the processing of obtaining the downmix signal which is the signal obtained by mixing the sound signals of the two channels, the processing is performed by processing of mixing the sound signals of the two channels in the frequency domain, for example, like step S111-5B (step S113). That is, the stereo encoding unit 110 obtains a downmix signal obtained by mixing the sound signals of the two channels in the frequency domain. For example, in the processing of the current frame, the stereo encoding unit 110 can obtain a downmix signal, which is a monaural signal obtained by mixing the sound signals of the two channels in the frequency domain, for the section from t1 to t6. When the encoding device 100 is also provided with the downmix unit 150, the stereo encoding unit 110 obtains and outputs the stereo code CS representing the characteristic parameter, which is a parameter representing the characteristic of the difference between the sound signals of the two channels inputted from the two-channel stereo input sound signal inputted to the encoding device 100 (step S113), and the downmix unit 150 obtains and outputs the downmix signal, which is a signal obtained by mixing the sound signals of the two channels in the frequency domain, from the two-channel stereo input sound signal inputted to the encoding device 100 (step S153).
The encoding device 100 of the third embodiment also includes a monaural encoding target signal generation unit 140 as shown by a one-dot chain line in
The monaural encoding target signal outputted by the monaural encoding target signal generation unit 140 is inputted to the monaural encoding unit 120 instead of the downmix signal outputted by the stereo encoding unit 110 or the downmix unit 150. The monaural encoding unit 120 encodes the monaural encoding target signal to obtain and output the monaural code CM (step S123). For example, in the processing of the current frame, the monaural encoding unit 120 encodes the signal for the section from t1 to t6 obtained by applying the window in the shape of increasing in the section from t1 to t2 where the current frame and the immediately previous frame overlap, attenuating in the section from t5 to t6 where the current frame and the immediately following frame overlap and being flat in the section from t2 to t5 between the above sections, to the monaural encoding target signal, using the section from t6 to t7 of the monaural encoding target signal which is “a look-ahead section” for analysis processing, to obtain and output the monaural code CM.
Similarly to the additional encoding unit 130 of the second embodiment, the additional encoding unit 130 encodes a difference signal or weighted difference signal (a signal configured by subtraction or weighted subtraction between sample values of corresponding samples) between the downmix signal and the monaural locally decoded signal for the section Y and the downmix signal for the section X to obtain and output the additional code CA (step S133). Here, the downmix signal for the section Y is a signal obtained by mixing the sound signals of the two channels in the frequency domain, and the monaural locally decoded signal for the section Y is a signal obtained by locally decoding a signal obtained by mixing the sound signals of the two channels in the time domain.
Note that, similarly to the additional encoding unit 130 of the second embodiment, the additional encoding unit 130 of the third embodiment can perform the first additional encoding that encodes the downmix signal for the section X to obtain a first additional code CA and the second additional encoding that encodes a signal obtained by concatenating the difference signal or weighted difference signal for the section Y and the difference signal or weighted difference signal between the downmix signal for the section X and the locally decoded signal of the first additional encoding to obtain a second additional code CA2, and obtain a concatenation of the first additional code CA1 and the second additional code CA2 as the additional code CA, as described in [Specific example 1 of additional encoding unit 130].
Similarly to the monaural decoding unit 210 of the second embodiment, the monaural decoding unit 210 obtains and outputs the monaural decoded sound signal for the section Y using the monaural code CM (step S213). However, the monaural decoded sound signal obtained by the monaural decoding unit 120 of the third embodiment is a decoded signal of a signal obtained by mixing sound signals of the two channels in the time domain.
Similarly to the additional decoding unit 230 of the second embodiment, the additional decoding unit 230 decodes the additional code CA to obtain and output additional decoded signals for the sections Y and X, (step S233). However, the additional decoded signal for the section Y includes difference between a signal obtained by mixing sound signals of the two channels in the time domain and the monaural decoded sound signal, and difference between a signal obtained by mixing the sound signals of the two channels in the frequency domain and the signal obtained by mixing the sound signals of the two channels in the time domain.
The stereo decoding unit 220 performs steps S223-1 and S223-2 below (step S223). Similarly to the additional decoding unit 230 of the second embodiment, the stereo decoding unit 220 obtains a signal obtained by concatenating a sum signal or weighted sum signal (a signal configured by addition or weighted addition of sample values of corresponding samples) of the monaural decoded sound signal for the section Y and an additional decoded signal for the section Y and an additional decoded signal for the section X, as the decoded downmix signal for the section Y+X (step S223-1), and obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal obtained at step S223-1 by the upmix processing using the characteristic parameter obtained from the stereo code CS (step S223-2). However, the sum signal for the section Y includes the monaural decoded sound signal obtained by performing monaural encoding/decoding of the signal obtained by mixing the sound signals of the two channels in the time domain, difference between the signal obtained by mixing the sound signals of the two channels in the time domain and the monaural decoded sound signal, and difference between the signal obtained by mixing the sound signals of the two channels in the frequency domain and the signal obtained by mixing the sound signals of the two channels in the time domain.
As for the section X, though a correct locally decoded signal or decoded signal cannot be obtained by the monaural encoding unit 120 or the monaural decoding unit 210 without a signal and codes of the immediately following frame, an incomplete locally decoded signal or decoded signal can be obtained only with a signal and codes of frames up to the current frame. Therefore, each of the first to third embodiments may be changed so that, as for the section X, the additional encoding unit 130 encodes not the downmix signal itself but difference between the downmix signal and a monaural locally decoded signal obtained from the signal of the frames up to the current frame. This embodiment will be described as fourth embodiments.
First, for a fourth embodiment A, which is a fourth embodiment obtained by changing the second embodiment, description will be made mainly on points different from the second embodiment.
Similarly to the monaural encoding unit 120 of the second embodiment, the downmix signal outputted by the stereo encoding unit 110 or the downmix unit 150 is inputted to the monaural encoding unit 120. The monaural encoding unit 120 obtains and outputs the monaural code CM obtained by encoding the downmix signal, and a signal obtained by decoding the monaural code CM of the frames up to the current frame, that is, a monaural locally decoded signal, which is a locally decoded signal of the downmix signal for the section Y+X (step S124). More specifically, in addition to obtaining the monaural code CM of the current frame, the monaural encoding unit 120 obtains a locally decoded signal corresponding to the monaural code CM of the current frame, that is, a locally decoded signal to which the window in the shape of increasing in the 3.25 ms section from t1 to t2, being flat in the 16.75 ms section from t2 to t5 and attenuating in the 3.25 ms section from t5 to t6 is applied; and, by combining a locally decoded signal corresponding to the monaural code CM of the immediately previous frame and the locally decoded signal corresponding to the monaural code CM of the current frame, for the section from t1 to t2, and using the locally decoded signal corresponding to the monaural code CM of the current frame as it is, for the section from t2 to t6, the monaural encoding unit 120 obtains and outputs a locally decoded signal for the section of 23.25 ms from t1 to t6. However, a locally decoded signal for the section from t5 to t6 is a locally decoded signal that becomes a complete locally decoded signal by being combined with a locally decoded signal to which the window in the increasing shape is applied, the locally decoded signal being obtained in processing of the immediately following frame, and is an incomplete locally decoded signal to which the window in the attenuating shape is applied.
Similarly to the additional encoding unit 130 of the second embodiment, the downmix signal outputted by the stereo encoding unit 110 or the downmix unit 150, and the monaural locally decoded signal outputted by the monaural encoding unit 120 are inputted to the additional encoding unit 130. The additional encoding unit 130 encodes a difference signal or weighted difference signal (a signal configured by subtraction or weighted subtraction between sample values of corresponding samples) between the downmix signal and the monaural locally decoded signal for the section Y+X to obtain and output the additional code CA (step S134).
Similarly to the monaural decoding unit 210 of the second embodiment, the monaural code CM is inputted to the monaural decoding unit 210. The monaural decoding unit 210 obtains and outputs the monaural decoded sound signal for the section Y+X using the monaural code CM (step S214). However, the decoded signal for the section X, that is, the section from t5 to t6 is a decoded signal that becomes a complete decoded signal by being combined with a decoded signal to which the window in the increasing shape is applied, the decoded signal being obtained in processing of the immediately following frame, and is an incomplete decoded signal to which the window in the attenuating shape is applied.
Similarly to the additional decoding unit 230 of the second embodiment, the additional code CA is inputted to the additional decoding unit 230. The additional decoding unit 230 decodes the additional code CA to obtain and output an additional decoded signal for the section Y+X (step S234).
Similarly to the stereo decoding unit 220 of the second embodiment, the monaural decoded sound signal outputted by the monaural decoding unit 210, the additional decoded signal outputted by the additional decoding unit 230 and the stereo code CS inputted to the decoding device 200 are inputted to the stereo decoding unit 220. The stereo decoding unit 220 obtains a sum signal or weighted sum signal (a signal configured by addition or weighted addition of sample values of corresponding samples) of the monaural decoded sound signal and the additional decoded signal for the section Y+X as the decoded downmix signal, and obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal by the upmix processing using the characteristic parameter obtained from the stereo code CS (step S224).
Note that, by replacing “the downmix signal outputted by the stereo encoding unit 110 or the downmix unit 150” and “the downmix signal” in the description of the monaural encoding unit 120 of the fourth embodiment A with “the monaural encoding target signal outputted by the monaural encoding target signal generation unit 140” and “the monaural encoding target signal”, respectively, description mainly on points different from the third embodiment for a fourth embodiment B, which is a fourth embodiment obtained by changing the third embodiment, is obtained.
Further, by assuming that each of the monaural locally decoded signal obtained by the monaural encoding unit 120, the difference signal or weighted difference signal coded by the additional encoding unit 130 and the additional decoded signal obtained by the additional decoding unit 230 in the description of the fourth embodiment A is the signal for the section X, and the stereo decoding unit 220 obtains a signal that is a concatenation of the monaural decoded sound signal for the section Y and a sum signal or weighted sum signal of the monaural decoded sound signal and the additional decoded signal for the section X as the decoded downmix signal, a fourth embodiment C, which is a fourth embodiment obtained by changing the first embodiment, is obtained.
The downmix signal for the section X includes a part that can be predicted from the monaural locally decoded signal for the section Y. Therefore, for the section X, the difference between the downmix signal and a predicted signal from the monaural locally decoded signal for the section Y may be encoded by the additional encoding unit 130 in each of the first to fourth embodiments. This embodiment will be described as fifth embodiments.
First, regarding fifth embodiments obtained by changing the second embodiment, the third embodiment, the fourth embodiment A and the fourth embodiment B, respectively, as fifth embodiment A, points different from each of the second embodiment, the third embodiment, the fourth embodiment A and the fourth embodiment B will be described.
The additional encoding unit 130 performs steps S135A-1 and S135A-2 below (step S135A). First, the additional encoding unit 130 obtains a predicted signal for the monaural locally decoded signal for the section X from the inputted monaural locally decoded signal for the section Y or the section Y+X (however, for the section X, an incomplete monaural locally decoded signal as described above) using a predetermined well-known prediction technology (step S135A-1). Note that, in the case of the fifth embodiment obtained by changing the fourth embodiment A or the fifth embodiment obtained by changing the fourth embodiment B, the inputted incomplete monaural locally decoded signal for the section X is included in the predicted signal for the section X. Next, the additional encoding unit 130 encodes the difference signal or weighted difference signal (a signal configured by subtraction or weighted subtraction between sample values of corresponding samples) between the downmix signal and the monaural locally decoded signal for the section Y and a difference signal or weighted difference signal (a signal configured by subtraction or weighted subtraction between sample values of corresponding samples) between the downmix signal for the section X and the predicted signal obtained at step S135A-1 to obtain and output the additional code CA (step S135A-2). For example, a signal obtained by concatenating the difference signal for the section Y and the difference signal for the section X may be encoded to obtain the additional code CA. Further, for example, each of the difference signal for the section Y and the difference signal for the section X may be encoded to obtain a code, and a concatenation of the obtained codes may be obtained as the additional code CA. For the encoding, an encoding scheme similar to that of the additional encoding unit 130 of each of the second embodiment, the third embodiment, the fourth embodiment A and the fourth embodiment B can be used.
The stereo decoding unit 220 performs steps S225A-0 to S225A-2 below (step S225A). First, the stereo decoding unit 220 obtains the predicted signal for the section X from the monaural decoded sound signal for the section Y or the section Y+X, using the same prediction technology used by the additional encoding unit 130 at step S135 (step S225A-0). Next, the stereo decoding unit 220 obtains a signal obtained by concatenating a sum signal or weighted sum signal (a signal configured by addition or weighted addition of sample values of corresponding samples) of the monaural decoded sound signal and the additional decoded signal for the section Y and a sum signal or weighted sum signal (a signal configured by addition or weighted addition of sample values of corresponding samples) of the additional decoded signal and the predicted signal for the section X as the decoded downmix signal for the section Y+X (step S225A-1). Next, the stereo decoding unit 220 obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal obtained at step S225A-1, by the upmix processing using the characteristic parameter obtained from the stereo code CS (step S225A-2).
Next, regarding a fifth embodiment obtained by changing each of the first embodiment and the fourth embodiment C as a fifth embodiment B, points different from each of the first embodiment and the fourth embodiment C will be described.
In addition to the monaural code CM obtained by encoding the downmix signal, the monaural encoding unit 120 also obtains and outputs, for the section Y or the section Y+X, a signal obtained by decoding the monaural code CM of the frames up to the current frame, that is, a monaural locally decoded signal which is a locally decoded signal of the inputted downmix signal (step S125B). However, as described above, the monaural locally decoded signal for the section X is an incomplete monaural locally decoded signal.
The additional encoding unit 130 performs steps S135B-1 and S135B-2 below (step S135B). First, the additional encoding unit 130 obtains the predicted signal for the monaural locally decoded signal for the section X from the inputted monaural locally decoded signal for the section Y or the section Y+X (however, for the section X, an incomplete monaural locally decoded signal as described above) using the predetermined well-known prediction technology (step S135B-1). In the case of the fifth embodiment obtained by changing the fourth embodiment C, the inputted monaural locally decoded signal for the section X is included in the predicted signal for the section X. Next, the additional encoding unit 130 encodes the difference signal or weighted difference signal between the downmix signal for the section X and the predicted signal (a signal configured by subtraction or weighted subtraction between sample values of corresponding samples) obtained at step S135B-1 to obtain and output the additional code CA (step S135B-2). For the encoding, for example, an encoding scheme similar to that of the additional encoding unit 130 of each of the first embodiment and the fourth embodiment C can be used.
The stereo decoding unit 220 performs steps S225B-0 to S225B-2 below (step S225B). First, the stereo decoding unit 220 obtains the predicted signal for the section X from the monaural decoded sound signal for the section Y or the section Y+X, using the same prediction technology used by the additional encoding unit 130 (step S225B-0). Next, the stereo decoding unit 220 obtains a signal that is a concatenation of the monaural decoded sound signal for the section Y and a sum signal or weighted sum signal (a signal configured by addition or weighted addition of sample values of corresponding samples) of the additional decoded signal and the predicted signal for the section X as the decoded downmix signal for the section Y+X (step S225B-1). Next, the stereo decoding unit 220 obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal obtained at step S225B-1, by the upmix processing using the characteristic parameter obtained from the stereo code CS (step S225B-2).
In the first to fifth embodiments, by the decoding device 200 using the additional code CA obtained by the encoding device 100, the decoded downmix signal for the section X used by the stereo decoding unit 220 is obtained at least by decoding the additional code CA. However, without the decoding device 200 using the additional code CA, a predicted signal from the monaural decoded sound signal for the section Y may be used as the decoded downmix signal for the section X used by the stereo decoding unit 220. This embodiment is regarded as a sixth embodiment, and points different from the first embodiment will be described.
The encoding device 100 of the sixth embodiment is different from the encoding device 100 of the first embodiment in that the encoding device 100 of the sixth embodiment does not include the additional encoding unit 130, does not code the downmix signal for the section X, and does not obtain the additional code CA. That is, the encoding device 100 of the sixth embodiment includes the stereo encoding unit 110 and the monaural encoding unit 120, and the stereo encoding unit 110 and the monaural encoding unit 120 operates as the stereo encoding unit 110 and the monaural encoding unit 120 of the first embodiment, respectively.
The decoding device 200 of the sixth embodiment does not include the additional decoding unit 230 for decoding the additional code CA but includes the monaural decoding unit 210 and the stereo decoding unit 220. Though the monaural decoding unit 210 of the sixth embodiment operates as the monaural decoding unit 210 of the first embodiment, the monaural decoding unit 210 also outputs the monaural decoded sound signal for the section X when the stereo decoding unit 220 uses the monaural decoded sound signal for the section Y+X. Further, the stereo decoding unit 220 of the sixth embodiment operates as described below that is different from the operation of the stereo decoding unit 220 of the first embodiment.
The stereo decoding unit 220 performs steps S226-0 to S226-2 below (step S226). First, the stereo decoding unit 220 obtains the predicted signal for the section X from the monaural decoded sound signal for the section Y or the section Y+X, using a predetermined well-known prediction technology similar to that of the fifth embodiment (step S226-0). Next, the stereo decoding unit 220 obtains a signal that is a concatenation of the monaural decoded sound signal for the section Y and the predicted signal for the section X, as the decoded downmix signal for the section Y+X (step S226-1) and obtains and outputs the decoded sound signals of the two channels from the decoded downmix signal obtained at step S226-1 by the upmix processing using the characteristic parameter obtained from the stereo code CS (step S226-2) .
In each embodiment described above, description has been made on an example in which sound signals of two channels are handled to simplify the description. However, the number of channels is not limited thereto and is only required to be two or more. When the number of channels is indicated by C (C is an integer of 2 or larger), each embodiment described above can be implemented, with “two channels” replaced with “C channels” (C is an integer of 2 or larger).
For example, the encoding device 100 of the first to fifth embodiments can obtain the stereo code CS, the monaural code CM and the additional code CA from inputted sound signals of the C channels; the encoding device 100 of the sixth embodiment can obtain the stereo code CS and the monaural code CM from the inputted sound signals of the C channels; the stereo encoding unit 110 can obtain and output a code representing information corresponding to difference between channels of the inputted sound signals of the C channels, as the stereo code CS; the stereo encoding unit 110 or the downmix unit 150 can obtain and output a signal obtained by mixing the inputted sound signals of the C channels as the downmix signal; and the monaural encoding target signal generation unit 140 can obtain and output a signal obtained by mixing the inputted sound signals of the C channels in the time domain, as the monaural encoding target signal. For example, the information corresponding to the difference between channels of the sound signals of the C channels is, for each of C-1 channels other than a reference channel, information corresponding to difference between the sound signal of the channel and the sound signal of the reference channel.
Similarly, the decoding device 200 of the first to fifth embodiments can obtain and output the decoded sound signals of the C channels based on the inputted monaural code CM, additional code CA and stereo code CS; the decoding device 200 of the sixth embodiment can obtain and output the decoded sound signals of the C channels based on the inputted monaural code CM and stereo code CS; and the stereo decoding unit 220 can obtain and output the decoded sound signals of the C channels from the decoded downmix signal by the upmix processing using the characteristic parameter obtained based on the inputted stereo code CS. More specifically, the stereo decoding unit 220 can obtain and output the decoded sound signals of the C channels, regarding the decoded downmix signal as a signal obtained by mixing the decoded sound signals of the C channels, and regarding the characteristic parameter obtained based on the inputted stereo code CS as information representing the characteristic of the difference between channels of the decoded sound signals of the C channels.
The processing of each unit of each encoding device and each decoding device described above may be realized by a computer. In this case, the processing content of a function that each device should have is written by a program. Then, by loading the program onto a storage 1020 of a computer shown in
The program in which the processing content is written can be recorded in a computer-readable recording medium. The computer-readable recording medium is, for example, a non-transitory recording medium, specifically, a magnetic recording device, an optical disk or the like.
Further, distribution of this program is performed, for example, by sales, transfer, lending or the like of a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Furthermore, a configuration is also possible in which this program is stored in a storage of a server computer, and is distributed by transfer from the server computer to other computers via a network.
The computer that executes such a program once stores the program recorded in a portable recording medium or transferred from a server computer into an auxiliary recording unit 1050, which is its own non-transitory storage, first. Then, at the time of executing processing, the computer loads the program stored in the auxiliary recording unit 1050, which is its own non-transitory storage, into the storage 1020 and executes the processing according to the loaded program. Further, as another execution form of this program, a computer may directly load the program from a portable recording medium into the storage 1020 and execute the processing according to the program. Furthermore, each time a program is transferred to the computer from a sever computer, the computer may sequentially execute processing according to the received program. Further, a configuration is also possible in which the above processing is executed by a so-called ASP (application service provider) type service in which, without transferring the program to the computer from the server computer, the processing functions are realized only by an instruction to execute the program and acquisition of a result. Note that it is assumed that, as the program described herein, information which is provided for processing by an electronic calculator and is equivalent to a program (data or the like which is not a direct command to the computer but has a nature of specifying processing of the computer) is included.
Though it is assumed in the description above that the device is configured by causing a predetermined program to be executed on a computer, at least a part of the processing content may be realized by hardware.
In addition, it goes without saying that it is possible to appropriately make changes within a range not departing from the spirit of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/024775 | 6/24/2020 | WO |