This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2012-147500, filed on Jun. 29, 2012, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to, for example, an audio encoding device, an audio encoding method, a computer-readable recording medium storing an audio encoding computer program, and an audio decoding device.
Currently, methods for encoding an audio signal that compress the amount of data of multichannel audio signals of three or more channels are being developed. As one of such encoding methods, an MPEG Surround method standardized by the Moving Picture Experts Group (MPEG) is known. In the MPEG Surround method, for example, 5.1-channel (5.1ch) audio signals to be encoded are subjected to a time-frequency transform, and frequency signals obtained by the time-frequency transform are downmixed, thereby generating frequency signals of three channels. The frequency signals of the three channels are further downmixed, and, as a result, frequency signals corresponding to stereophonic signals of two channels are calculated. The frequency signals corresponding to the stereophonic signals are then encoded using an Advanced Audio Coding (MC) encoding method and a Spectral Band Replication (SBR) encoding method. On the other hand, in the MPEG Surround method, when the 5.1ch signals are downmixed to generate the signals of the three channels and when the signals of the three channels are downmixed to generate the signals of the two channels, spatial information indicating the diffusion of a sound or the location of a sound is calculated and encoded. Thus, in the MPEG Surround method, the stereophonic signals generated by downmixing the multichannel audio signals and the spatial information whose amount of data is relatively small are encoded. Therefore, in the MPEG Surround method, the efficiency of compression higher than in a case in which the signal of each channel included in the multichannel audio signals is separately encoded.
In the MPEG Surround method, in order to reduce the amount of information to be encoded, the frequency signals of the three channels are divided into stereophonic frequency signals and two channel prediction coefficients and encoded. The channel prediction coefficients are coefficients for performing predictive coding on a signal of one of the three channels on the basis of the signals of the other two channels. A plurality of channel prediction coefficients are stored in a table called a “code book”. The code book is used to improve the efficiency of bits used. When an encoder and a decoder have a predetermined common code book (or a code book created using a common method), important information may be transmitted with a smaller number of bits. In decoding, a signal of one of the three channels is reproduced on the basis of the channel prediction coefficients. Therefore, in encoding, the channel prediction coefficients are selected from the code book.
As a method for selecting the channel prediction coefficients from the code book, a method has been disclosed in which an error defined by a difference between a channel signal before predictive coding and a channel signal after the predictive coding is calculated using all the channel prediction coefficients stored in the code book, and a channel prediction coefficient with which the error caused by the predictive coding becomes smallest is selected. In Japanese National Publication of International Patent Application No. 2008-517338, a method is disclosed in which a channel prediction coefficient with which an error becomes smallest is calculated using a calculation method adopting a method of least squares.
In accordance with an aspect of the embodiments, an audio encoding device includes a processor; and a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute, calculating first phases indicating phases of a first channel signal and a second channel signal included in audio signals of a plurality of channels; and performing, on the basis of the first phases, either first predictive coding in which a third channel signal included in the audio signals of the plurality of channels is predicted using the first channel signal and the second channel signal or second predictive coding in which the second channel signal is predicted using the first channel signal.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
These and/or other aspects and advantages will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawing of which:
An audio encoding device, an audio encoding method, an audio encoding computer program, and an audio decoding device according to embodiments will be described in detail hereinafter with reference to the drawings. These embodiments do not limit the technology disclosed herein.
(First Embodiment)
These components included in the audio encoding device 1 are formed as separate circuits. Alternatively, these components included in the audio encoding device 1 may be mounted on the audio encoding device 1 as a single integrated circuit in which circuits corresponding thereto are integrated with one another. Alternatively, these components included in the audio encoding device 1 may be function modules realized by a computer program executed by a processor included in the audio encoding device 1.
The time-frequency transform unit 11 transforms a signal of each channel of multichannel audio signals in a time domain input to the audio encoding apparatus 1 into a frequency signal of each channel by performing a time-frequency transform for each frame. In the present embodiment, the time-frequency transform unit 11 transforms a signal of each channel into a frequency signal using a quadrature mirror filter (QMF) bank represented by the following expression:
Here, n is a variable denoting time, that is, an n-th time when an audio signal of one frame is divided into 128 pieces in a time direction. A frame length may be, for example, within a range of 10 to 80 ms. k is a variable denoting a frequency band, that is, a k-th frequency band when a frequency band included in a frequency signal is divided into 64 pieces. QMF(k, n) is a QMF for outputting a frequency signal of a time n and a frequency band k. The time-frequency transform unit 11 multiplies an input audio signal of one frame of a channel by QMF(k, n) to generate a frequency signal of the channel. Alternatively, the time-frequency transform unit 11 may transform a signal of each channel into a frequency signal by using another time-frequency transform process such as a fast Fourier transform, a discrete cosine transform, or a modified discrete cosine transform (MDCT).
Each time the time-frequency transform unit 11 has calculated a frequency signal of each channel for each frame, the time-frequency transform unit 11 outputs the frequency signal of each channel to the first downmixing unit 12.
Each time the first downmixing unit 12 has received a frequency signal of each channel, the first downmixing unit 12 downmixes the frequency signal of each channel to generate frequency signals of a left channel, a center channel, and a right channel. For example, the first downmixing unit 12 calculates the frequency signals of the three channels in accordance with the following expressions:
L
in(k, n)=Lin
L
in
(k, n)=LRe(k, n)+SLRe(k,n)
L
inIm(k, n)=LIm(k, n)+SLIm(k, n)
R
in(k,n)=Rin
R
in
(k,n)=RRe(k, n)+SRRe(k, n) (2)
R
inIm(k,n)=RIm(k, n)+SRIm(k, n)
C
in(k, n)=Cin
C
in
(k, n)=CRe(k, n)+LFERe(k,n)
C
inIm(k, n)=CIm(k, n)+LFEIm(k, n)
Here, LRe(k, n) denotes a real part of a frequency signal L(k, n) of a left front channel, and LIm(k, n) denotes an imaginary part of the frequency signal L(k, n) of the left front channel. SLRe(k, n) denotes a real part of a frequency signal SL(k, n) of a left rear channel, and SLIm(k, n) denotes an imaginary part of the frequency signal SL(k, n) of the left rear channel. Lin(k, n) denotes a frequency signal of the left channel generated by the downmixing. LinRe(k, n) denotes a real part of the frequency signal of the left channel, and LinIm(k, n) denotes an imaginary part of the frequency signal of the left channel.
Similarly, RRe(k, n) denotes a real part of a frequency signal R(k, n) of a right front channel, and RIm(k, n) denotes an imaginary part of the frequency signal R(k, n) of the right front channel. SRRe(k, n) denotes a real part of a frequency signal SR(k, n) of a right rear channel, and SRIm(k, n) denotes an imaginary part of the frequency signal SR(k, n) of the right rear channel. Rin(k, n) denotes a frequency signal of the right channel generated by the downmixing. RinRe(k, n) denotes a real part of the frequency signal of the right channel, and RinIm(k, n) denotes an imaginary part of the frequency signal of the right channel.
Furthermore, CRe(k, n) denotes a real part of a frequency signal C(k, n) of a center channel, and CIm(k, n) denotes an imaginary part of the frequency signal C(k, n) of the center channel. LFERe(k, n) denotes a real part of a frequency signal LFE(k, n) of a low-frequency effects channel, and LFEIm(k, n) denotes an imaginary part of the frequency signal LFE(k, n) of the low-frequency effects channel. Cin(k, n) denotes a frequency signal of the center channel generated by the downmixing. CinRe(k, n) denotes a real part of the frequency signal Cin(k, n) of the center channel, and CinIm(k, n) denotes an imaginary part of the frequency signal Cin(k, n) of the center channel.
The first downmixing unit 12 calculates, as spatial information between frequency signals of two channels to be downmixed, a difference in intensity between the frequency signals, which is information indicating the location of a sound, and a degree of similarity between the frequency signals, which is information indicating the diffusion of a sound, for each frequency band. These pieces of spatial information calculated by the first downmixing unit 12 are example of three-channel spatial information. In the present embodiment, the first downmixing unit 12 calculates a difference in intensity CLDL(k) and a degree of similarity ICCL(k) of the frequency band k for the left channel in accordance with the following expressions:
Here, N is the number of samples in the time direction included in one frame, which is 128 in the present embodiment. eL(k) is an autocorrelation value of the frequency signal L(k, n) of the left front channel, and eSL(k) is an autocorrelation value of the frequency signal SL(k, n) of the left rear channel. eLSL(k) is a cross-correlation value of the frequency signal L(k, n) of the left front channel and the frequency signal SL(k, n) of the left rear channel.
Similarly, the first downmixing unit 12 calculates a difference in intensity CLDR(k) and a degree of similarity ICCR(k) of the frequency band k for the right channel in accordance with the following expressions:
Here, eR(k) is an autocorrelation value of the frequency signal R(k, n) of the right front channel, and eSR(k) is an autocorrelation value of the frequency signal SR(k, n) of the right rear channel. eRSR(k) is a cross-correlation value of the frequency signal R(k, n) of the right front channel and the frequency signal SR(k, n) of the right rear channel.
Furthermore, the first downmixing unit 12 calculates a difference in intensity CLDC(k) of the frequency band k for the center channel in accordance with the following expressions:
Here, eC(k) is an autocorrelation value of the frequency signal C(k, n) of the center channel, and eLFE(k) is an autocorrelation value of the frequency signal LFE(k, n) of the low-frequency effects channel.
After generating the frequency signals of the three channels, the first downmixing unit 12 further downmixes the frequency signal of the left channel and the frequency signal of the center channel to generate a left frequency signal of stereophonic frequency signals. The first downmixing unit 12 downmixes the frequency signal of the right channel and the frequency signal of the center channel to generate a right frequency signal of the stereophonic frequency signals. The first downmixing unit 12 generates a left frequency signal L0(k, n) and a right frequency signal R0(k, n) of the stereophonic frequency signals in accordance with, for example, the following expression. Furthermore, for example, the first downmixing unit 12 calculates a signal C0(k, n) of the center channel used to select a channel prediction coefficient included in a code book in accordance with the following expression:
Here, Lin(k, n), Rin(k, n), and Cin(k, n) are the frequency signals of the left channel, the right channel, and the center channel, respectively, generated by the first downmixing unit 12. The left frequency signal L0(k, n) is a combination between the frequency signals of the left front channel, the left rear channel, the center channel, and the low-frequency effects channel of the original multichannel audio signals. Similarly, the right frequency signal R0(k, n) is a combination between the frequency signals of the right front channel, the right rear channel, the center channel, and the low-frequency effects channel of the original multichannel audio signals.
The first downmixing unit 12 outputs the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel to the calculation unit 13 and the second downmixing unit 14. The first downmixing unit 12 also outputs the differences in intensity CLDL(k), CLDR(k), and CLDC(k) and the degrees of similarity ICCL(k) and ICCR(k), which are the spatial information, to the spatial information encoding unit 20.
The calculation unit 13 receives the frequency signals of the three channels, namely the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel, from the first downmixing unit 12. The calculation unit 13 then calculates first phases, which indicate the phases of the left frequency signal L0(k, n) and the right frequency signal R0(k, n). The calculation unit 13 also calculates second phases, which indicate the phases of the left frequency signal L0(k, n) or the right frequency signal R0(k, n) and the signal C0(k, n) of the center channel as occasion calls.
The calculation unit 13 outputs the left frequency signal L0(k, n), the right frequency signal R0(k, n), the signal C0(k, n) of the center channel, and the first phases to the predictive coding unit 15. The calculation unit 13 also outputs the second phases to the predictive coding unit 15 as occasion calls. Details of the reason why the calculation unit 13 calculates the first phases and the second phases will be described later, but these phases are used by the predictive coding unit 15 to determine whether or not it is possible to perform predictive coding of the signal C0(k, n) of the center channel using the left frequency signal L0(k, n) and the right frequency signal R0(k, n) (whether or not an error will be significantly large).
Here, a specific method for calculating the first phases and the second phases used by the calculation unit 13 will be described. First, a case in which the first phases are calculated will be described. By expanding the left frequency signal L0(k, n) and the right frequency signal R0(k, n) described in the expression 8, the following expressions are obtained:
Now, substitute the expression 9 for the following expressions:
As a result, cosθ1, which corresponds to the first phases, may be calculated by the following expression:
Here, if the value of cosθ1 is −1, the first phases are opposite phases, and if the value of cosθ1 is 1, the first phases are identical phases. Calculation of the second phases may be performed in the same manner as the calculation of the first phases, and therefore detailed description thereof is omitted.
The second downmixing unit 14 downmixes two of the frequency signals of the three channels received from the first downmixing unit 12, namely the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel, to generate stereophonic frequency signals of two channels. The second downmixing unit 14 then outputs the generated stereophonic frequency signals to the channel signal encoding unit 16. Details of the operation of the second downmixing unit 14 will be described later.
The predictive coding unit 15 selects channel prediction coefficients for the frequency signals of the two channels downmixed by the second downmixing unit 14 from the code book. For convenience of description, predictive coding of the signal C0(k, n) of the center channel based on the right frequency signal R0(k, n) and the left frequency signal L0(k, n) will be referred to as first predictive coding. When the predictive coding unit 15 performs the first predictive coding, the second downmixing unit 14 downmixes the left frequency signal L0(k, n) and the right frequency signal R0(k, n) to generate the stereophonic frequency signals of the two channels. When the first phases are other than identical phases and opposite phases, the predictive coding unit 15 performs the first predictive coding, the reason for which will be described later. When performing the first predictive coding, the predictive coding unit 15 selects, for each frequency band, channel prediction coefficients c1(k) and c2(k) with which an error d(k) between frequency signals before and after the predictive coding defined by the following expressions on the basis of C0(k, n), L0 (k, n), and R0(k, n) becomes smallest from the code book. Thus, the predictive coding unit 15 generates a signal C0(k, n) of the center channel after the predictive coding by performing the predictive coding.
The predictive coding unit 15 refers to a quantization table, which is included in the predictive coding unit 15, representing correspondences between typical values of the channel prediction coefficients c1(k) and c2(k) and index values using the channel prediction coefficients c1(k) and c2(k) included in the code book. The predictive coding unit 15 determines index values closest to the channel prediction coefficients c1(k) and c2(k) for each frequency band by referring to the quantization table. Here, a specific example will be described.
Next, the predictive coding unit 15 calculates, for each frequency band, a difference value between index values in a frequency direction. For example, if the index value for the frequency band k is 2 and the index value for a frequency band (k-1) is 4, the predictive coding unit 15 determines the difference value between the index values for the frequency band k as −2.
Next, the predictive coding unit 15 refers to a coding table representing correspondences between difference values between index values and channel prediction coefficient codes. The predictive coding unit 15 determines a channel prediction coefficient code idxcm(k) (m=1, 2 or m=1) for the difference value for each frequency band in the case of the channel prediction coefficient cm(k) (m=1, 2 or m=1) by referring to the coding table. The predictive coefficient code may be, for example, as with a similarity code, a variable-length code whose code length becomes short as the frequency of occurrence of the difference value becomes high, such as a Huffman code or an arithmetic code. The quantization table and the coding table are stored in advance in a memory, which is not illustrated, included in the predictive coding unit 15. In
Now, the reason why there is a case in which when the predictive coding unit 15 has performed the first predictive coding, the error d(k) caused by the expression 11 becomes significantly large, and therefore it is difficult to properly perform the predictive coding, which has been newly found out by the present inventors, will be described.
Here, the predictive coding unit 15 may perform the predictive coding on the signal C0(k, n) of the center channel by selecting, from the coding book, the channel prediction coefficients c1(k) and c2(k) with which the error d(k) between the signal C0(k, n) of the center channel before the predictive coding and the signal C0(k, n) of the center channel after the predictive coding becomes smallest. This concept is represented by the expressions described in the expression 9. A cosine function cosθ1 of the vector of the left frequency signal L0(k, n) and the vector of the right frequency signal R0(k, n) corresponds to the first phases indicating the phases of the left frequency signal L0(k, n) and the right frequency signal R0(k, n). A cosine function cosθ2 of the vector of the left frequency signal L0(k, n) or the vector of the right frequency signal R0(k, n) and the vector of the signal C0(k, n) of the center channel corresponds to the second phases indicating the phases of the signal C0(k, n) of the center channel and the left frequency signal L0(k, n) or the right frequency signal R0(k, n).
Because the signal C0(k, n) of the center channel may be decomposed into the vector of the left frequency signal L0(k, n) and the vector of the right frequency signal R0(k, n) when the first phases are other than identical phases or opposite phases, the predictive coding unit 15 may perform the first predictive coding while giving the first predictive coding priority over second predictive coding and the like, which will be described later. This is because the left frequency signal L0(k, n) and the right frequency signal R0(k, n) generally have a high degree of similarity, and therefore the efficiency of the coding performed by the channel signal encoding unit 16 illustrated in
However, in
Therefore, even when it is difficult to decompose the signal C0(k, n) of the center channel into the vector of the left frequency signal L0(k, n) and the vector of the right frequency signal R0(k, n) (when it is difficult to properly perform the predictive coding on the signal C0(k, n) of the center channel), the right frequency signal R0(k, n) may be properly subjected to the predictive coding by utilizing the vector of the left frequency signal L0(k, n) in the second predictive coding. By performing the predictive coding not on the signal C0(k, n) of the center channel but on the right frequency signal R0(k, n) in the second predictive coding, the error caused by the predictive coding may be suppressed.
Alternatively, the predictive coding unit 15 may perform the predictive coding on the left frequency signal L0(k, n) by utilizing the vector of the right frequency signal R0(k, n) and by selecting, from the code book, the channel prediction coefficient c1(k) with which the error d(k) caused by the predictive coding becomes smallest. A left frequency signal L′0(k, n) after the predictive coding may be represented by the following expressions:
The predictive coding performed on the left frequency signal L0(k, n) by utilizing the right frequency signal R0(k, n) or the predictive coding performed on the right frequency signal R0(k, n) by utilizing the left frequency signal L0(k, n) will be referred to as the second predictive coding herein for convenience of description. The predictive coding unit 15 may define the smallest error d(k) calculated from the expression 12 as a first error and the smallest error d(k) calculated from the expression 13 as a second error and compare the first and second errors, in order to perform the second predictive coding using the expression 12 or 13, whichever the error d(k) is smaller.
However, the cosine function cosθ1 of the vector of the left frequency signal L0(k, n) and the vector of the right frequency signal R0(k, n) is 0°. Therefore, the right frequency signal R0(k, n) may be subjected to the predictive coding by, for example, utilizing the vector of the left frequency signal L0(k, n) and by selecting, from the code book, the channel prediction coefficient c1(k) with which the error d(k) caused by the predictive coding becomes smallest. The right frequency signal R′0(k, n) after the predictive coding may be represented by the expression 12.
Alternatively, the predictive coding unit 15 may perform the predictive coding on the left frequency signal L0(k, n) by utilizing the vector of the right frequency signal R0(k, n) and by selecting, from the code book, the channel prediction coefficient c1(k) with which the error d(k) caused by the predictive coding becomes smallest. The left frequency signal L′0(k, n) after the predictive coding may be represented by the expression 13.
Here, when the predictive coding unit 15 preforms the second predictive coding, the second downmixing unit 14 downmixes either the right frequency signal R0(k, n) or the left frequency signal L0(k, n) and the signal C0(k, n) of the center channel in order to generate the stereophonic frequency signals of the two channels.
In
The predictive coding unit 15 generates selection information including information indicating that the first predictive coding or the second predictive coding has been performed as the predictive coding, and outputs the selection information to the second downmixing unit 14 and the multiplexing unit 21 illustrated in
As described above, the predictive coding unit 15 may suppress the error caused by the predictive coding by performing the predictive coding on the basis of the first phases received from the calculation unit 13. Furthermore, since the number of channel prediction coefficients to be selected may be reduced to 1 when the second predictive coding is performed, a synergistic effect of reducing loads in the coding process may be produced.
The second downmixing unit 14 receives the selection information from the predictive coding unit 15 and downmixes two of the frequency signals of the three channels, namely the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel, on the basis of the selection information, in order to generate the stereophonic frequency signals of the two channels. More specifically, when the selection information includes the information indicating that the first predictive coding has been performed, the second downmixing unit 14 outputs, for example, the left frequency signal L0(k, n) and the right frequency signal R0(k, n) to the channel signal encoding unit 16 as first stereophonic frequency signals. On the other hand, when the selection information includes the information indicating that the second predictive coding has been performed, the second downmixing unit 14 outputs, for example, the signal C0(k, n) of the center channel and either the left frequency signal L0(k, n) or the right frequency signal R0(k, n) to the channel signal encoding unit 16 as second stereophonic frequency signals.
The channel signal encoding unit 16 encodes the stereophonic frequency signals received from the second downmixing unit 14. The channel signal encoding unit 16 includes the SBR encoding section 17, the frequency-time transform section 18, and the MC encoding section 19.
Upon receiving each stereophonic frequency signal, the SBR encoding section 17 encodes a high-frequency component, which is a component included in a high-frequency band of the stereophonic frequency signal, for each channel in accordance with an SBR encoding method. In doing so, the SBR encoding section 17 generates an SBR code. For example, as disclosed in Japanese Laid-open Patent Publication No. 2008-224902, the SBR encoding section 17 replicates a low-frequency component of a frequency signal of each channel that has a strong correlation with the high-frequency component to be subjected to the SBR encoding. The low-frequency component is a component of a frequency signal of each channel included in a low-frequency band, which is lower than the high-frequency band including the high-frequency component to be subjected to the encoding performed by the SBR encoding section 17, and encoded by the MC encoding section 19, which will be described later. The SBR encoding section 17 then adjusts the power of the high-frequency component obtained by the replication in such a way as to match the power of the original high-frequency component. The SBR encoding section 17 determines, in the original high-frequency component, a component that is so different from the low-frequency component that it is difficult to approximate the high-frequency component even if the low-frequency component is replicated as auxiliary information. The SBR encoding section 17 then performs the encoding by quantizing information indicating the positional relationship between the low-frequency component used for the replication and the corresponding high-frequency component, the amount of power adjusted, and the auxiliary information. The SBR encoding section 17 outputs the SBR code, which is the encoded information, to the multiplexing unit 21.
Upon receiving each stereophonic frequency signal, the frequency-time transform section 18 transforms the stereophonic frequency signal of each channel into a stereophonic signal in the time domain. For example, when the time-frequency transform unit 11 uses a QMF bank, the frequency-time transform section 18 performs a frequency-time transform on the stereophonic frequency signal of each channel using a complex QMF bank, which is represented by the following expression:
Here, IQMF(k, n) is a complex QMF having the time n and the frequency k as variables. When the time-frequency transform unit 11 uses another time-frequency transform process such as a fast Fourier transform, a discrete cosine transform, or an MDCT, the frequency-time transform section 18 uses an inverse transform of the time-frequency transform process. The frequency-time transform section 18 outputs a stereophonic signal of each channel obtained by performing the frequency-time transform on the frequency signal of each channel to the MC encoding section 19.
Upon receiving the stereophonic signal of each channel, the MC encoding section 19 encodes the low-frequency component of the signal of each channel in accordance with an MC encoding method in order to generate an MC code. For example, the MC encoding section 19 may use the technology disclosed in Japanese Laid-open Patent Publication No. 2007-183528. More specifically, the MC encoding section 19 generates the stereophonic frequency signal again by performing a discrete cosine transform on the received stereophonic signal of each channel. The MC encoding section 19 then calculates perceptual entropy (PE) from the regenerated stereophonic frequency signal. The PE indicates the amount of information used to quantize a certain block such that a listener does not perceive noise.
The PE has a characteristic that the value thereof becomes large for a sound whose signal level changes in a short period of time, such as an attack sound generated by a percussion instrument. Therefore, the MC encoding section 19 shortens a window for a frame for which the value of the PE becomes relatively large, and elongates the window for a block for which the value of the PE becomes relatively small. For example, a short window includes 256 samples, and a long window includes 2,048 samples. The MC encoding section 19 performs an MDCT on the stereophonic signal of each channel using a window having a determined length, in order to transform the stereophonic signal of each channel into a combination between MDCT coefficients. The MC encoding section 19 then quantizes the combination between MDCT coefficients and performs variable-length coding on the quantized combination between MDCT coefficients. The MC encoding section 19 outputs the combination between MDCT coefficients subjected to the variable-length coding and related information such as a quantization coefficient to the multiplexing unit 21 as an MC code.
The spatial information encoding unit 20 generates an MPEG Surround code (hereinafter referred to as an MPS code) from the spatial information received from the first downmixing unit 12 and the channel prediction coefficient code received from the predictive coding unit 15.
The spatial information encoding unit 20 refers to a quantization table representing correspondences between values of the degree of similarity included in the spatial information and index values. The spatial information encoding unit 20 determines an index value closest to the degree of similarity ICC1(k) (i=L, R, 0) for each frequency band by referring to the quantization table. The quantization table is stored in advance in a memory, which is not illustrated, included in the spatial information encoding unit 20.
Next, the spatial information encoding unit 20 calculates, for each frequency band, a difference value between index values in the frequency direction. For example, if the index value for the frequency band k is 3 and the index value for the frequency band (k-1) is 0, the spatial information encoding unit 20 determines the difference value between index values for the frequency band k as 3.
The spatial information encoding unit 20 refers to the a coding table representing correspondences between difference values between index values and similarity codes. The spatial information encoding unit 20 determines a similarity code idxicc1(k) (i=L, R, 0) for the difference value between index values for each frequency band in the case of the degree of similarity ICC1(k) (i=L, R, 0) by referring to the coding table. The coding table is stored in advance in the memory or the like included in the spatial information encoding unit 20. The similarity code may be, for example, a variable-length code whose code length becomes short as the frequency of occurrence of the difference value becomes high, such as a Huffman code or an arithmetic code.
The spatial information encoding unit 20 refers to a quantization table representing correspondences between values of the difference in intensity and index values. The spatial information encoding unit 20 determines an index value closest to the difference in intensity CLDj(k) (j=L, R, C, 1, 2) for each frequency band by referring to the quantization table. The spatial information encoding unit 20 calculates, for each frequency band, a difference value between index values in the frequency direction. For example, if the index value for the frequency band k is 2 and the index value for the frequency band (k-1) is 4, the spatial information encoding unit 20 determines the difference value between index values for the frequency band k as −2.
The spatial information encoding unit 20 refers to a coding table representing correspondences between difference values between index values and intensity difference codes. The spatial information encoding unit 20 determines an intensity difference code idxcldj(k) (j=L, R, C) for the difference value of the frequency band k in the case of the difference in intensity CLDj(k) by referring to the coding table. The intensity difference code may be, for example, as with the similarity code, a variable-length code whose code length becomes short as the frequency of occurrence of the difference value becomes high, such as a Huffman code or an arithmetic code. The quantization table and the coding table are stored in advance in the memory included in the spatial information encoding unit 20.
The spatial information encoding unit 20 generates an MPS code using the similarity code idxicci(k), the intensity difference code idxcldj(k), and the channel prediction coefficient code idxcm(k). For example, the spatial information encoding unit 20 generates the MPS code by arranging the similarity code idxicci(k), the intensity difference code idxcldj(k), and the channel prediction coefficient code idxcm(k) in a certain order. The certain order is described, for example, in ISO/IEC 23003-1:2007. The spatial information encoding unit 20 outputs the generated MPS code to the multiplexing unit 21.
The multiplexing unit 21 multiplexes the MC code, the SBR code, the MPS code, and the selection information by arranging these codes and the information in a certain order. The multiplexing unit 21 then outputs an encoded audio signal generated by the multiplexing.
The time-frequency transform unit 11 transforms the signal of each channel into a frequency signal (step S801). The time-frequency transform unit 11 then outputs the frequency signal of each channel to the first downmixing unit 12.
Next, the first downmixing unit 12 downmixes the frequency signal of each channel to generate frequency signals L0(k, n), R0(k, n), and C0(k, n) of three channels, namely right, left, and center channels. Furthermore, the first downmixing unit 12 calculates spatial information regarding the right, left, and center channels (step S802). The first downmixing unit 12 outputs the frequency signals of the three channels to the calculation unit 13 and the second downmixing unit 14.
The calculation unit 13 receives the frequency signals of the three channels, namely the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel, from the first downmixing unit 12. The calculation unit 13 then calculates the first phases on the basis of the left frequency signal L0(k, n) and the right frequency signal R0(k, n) using the expression 10 (step S803). Furthermore, the calculation unit 13 outputs the first phases to the predictive coding unit 15. In step S803, the calculation unit 13 calculates the second phases and outputs the second phases to the predictive coding unit 15 as occasion calls.
The predictive coding unit 15 receives the first phases from the calculation unit 13. The predictive coding unit 15 also receives the second phases from the calculation unit 13 as occasion calls. The predictive coding unit 15 performs the first predictive coding or the second predictive coding on the basis of the first phases (step S804). More specifically, when the first phases are other than identical phases or opposite phases, the predictive coding unit 15 performs the first predictive coding. When the first phases are opposite phases or identical phases, the predictive coding unit 15 performs the second predictive coding. When the second phases have been received from the calculation unit 13, the predictive coding unit 15 compares the first phases and the second phases. When the first phases and the second phases are identical phases or opposite phases, the predictive coding unit 15 may perform the predictive coding on the signal C0(k, n) of the center channel on the basis of the right frequency signal R0(k, n) or the left frequency signal L0(k, n) using the expression 14 or 15.
Next, the predictive coding unit 15 generates selection information including information indicating that the first predictive coding or the second predictive coding has been performed as the predictive coding, and outputs the selection information to the second downmixing unit 14 and the multiplexing unit 21 (step S805). In S805, when the selection information includes the information indicating that the second predictive coding has been performed, the predictive coding unit 15 causes the selection information to further include information indicating which of the left frequency signal L0(k, n) and the right frequency signal R0(k, n) has been used in the predictive coding. When the predictive coding unit 15 has performed the predictive coding using the expression 14 or 15, the predictive coding unit 15 may cause the selection information to further include information indicating that the first predictive coding has been performed. In addition, in step S805, the predictive coding unit 15 outputs a channel prediction coefficient code encoded in the first predictive coding or the second predictive coding to the spatial information encoding unit 20.
The second downmixing unit 14 receives the selection information from the predictive coding unit 15. The second downmixing unit 14 downmixes the frequency signals of the three channels on the basis of the selection information to generate stereophonic frequency signals. The second downmixing unit 14 then outputs the stereophonic frequency signals to the channel signal encoding unit 16 (step S806). More specifically, when the selection information includes the information indicating that the first predictive coding has been performed, the second downmixing unit 14 outputs the left frequency signal L0(k, n) and the right frequency signal R0(k, n) to the channel signal encoding unit 16. When the selection information includes the information indicating that the second predictive coding has been performed, the second downmixing unit 14 outputs the signal C0(k, n) of the center channel and either the left frequency signal L0(k, n) or the right frequency signal R0(k, n) to the channel signal encoding unit 16.
The spatial information encoding unit 20 generates an MPS code from the spatial information to be encoded received from the first downmixing unit 12 and the channel prediction coefficient code received from the predictive coding unit 15 (step S807). The spatial information encoding unit 20 then outputs the MPS code to the multiplexing unit 21.
The channel signal encoding unit 16 performs the SBR encoding on a high-frequency component of the received stereophonic frequency signal of each channel. In addition, the channel signal encoding unit 16 performs the MC encoding on a low-frequency component, which is not subjected to the SBR encoding, of the received stereophonic frequency signal of each channel (step S808). The channel signal encoding unit 16 outputs, to the multiplexing unit 21, an SBR code and an MC code including information indicating the positional relationship between the low-frequency component used for the replication and the corresponding high-frequency component.
Finally, the multiplexing unit 21 multiplexes the SBR code, the MC code, the MPS code, and the selection information that have been generated, in order to generate an encoded audio signal (step S809). The multiplexing unit 21 outputs the encoded audio signal. The audio encoding device 1 then ends the encoding process.
The audio encoding device 1 may perform the processing in step S807 and the processing in step S808 in parallel with each other. Alternatively, the audio encoding device 1 may perform the processing in step S808 before performing the processing in step S807.
The control unit 901 is a central processing unit (CPU) that controls other components and that calculates and processes data in a computer. The control unit 901 is an arithmetic device that executes programs stored in the main storage unit 902 and the auxiliary storage unit 903. The control unit 901 receives data from the input unit 907 or a storage device, calculates or processes the data, and outputs the data to the display unit 908 or the storage device.
The main storage unit 902 is a read-only memory (ROM), a random-access memory (RAM), or the like, and is a storage device that stores or temporarily saves programs and data such as an operating system (OS), which is basic software, and application software executed by the control unit 901.
The auxiliary storage unit 903 is a hard disk drive (HDD), and is a storage device that stores data relating to the application software and the like.
The drive unit 904 reads a program from a recording medium 905, namely, for example, a flexible disk, and installs the program in the auxiliary storage unit 903.
The recording medium 905 stores a certain program, and the certain program stored in the recording medium 905 is installed in the audio encoding device 1 through the drive unit 904. The installed certain program may be executed by the audio encoding device 1.
The network interface unit 906 is an interface between a peripheral device having a communication function connected through a network such as a local area network (LAN) or a wide area network (WAN) constructed by a data transmission path such as a wired line and/or a wireless line and the audio encoding device 1.
The input unit 907 includes a cursor key, a keyboard including numeric keys and various function keys, and a mouse, a touchpad, or the like for selecting a key on a display screen of the display unit 908. The input unit 907 is a user interface for the user to provide an operation instruction and input data to the control unit 901.
The display unit 908 includes a cathode ray tube (CRT) or a liquid crystal display (LCD), and displays display data input from the control unit 901.
The above-described audio encoding process may be realized as a program to be executed by the computer. By installing this program from a server or the like and causing the computer to executing the program, the above-described audio encoding process may be realized.
The program may be recorded on the recording medium 905, and the recording medium 905 on which the program is recorded may be read by the computer or a mobile terminal in order to realize the above-described audio encoding process. The recording medium 905 may be one of various types of recording media including recording media that optically, electrically, or magnetically records information, such as a compact disc read-only memory (CD-ROM), a flexible disk, and a magneto-optical disk, and semiconductor memories that electrically record information, such as a ROM and a flash memory.
As may be seen from
(Second Embodiment) When the second predictive coding is to be performed, the predictive coding unit 15 illustrated in
However, c2(k)=0
In this case, the predictive coding unit 15 selects the channel prediction coefficient c1(k) with which the error d(k) becomes smallest and 0, which is the channel prediction coefficient of c2(k). Because the same method may be used in a case in which the predictive coding is to be performed on the left frequency signal L0(k, n) or in a case in which the first phases and the second phases are identical phases or opposite phases and the predictive coding is to be performed on the signal C0(k, n) of the center channel, detailed description of the method is omitted.
(Third Embodiment) Although the cosine function cosθ1 of the vector of the left frequency signal L0(k, n) and the vector of the right frequency signal R0(k, n) is 180° and the first phases are opposite phases in
This is because, as illustrated in
According to yet another embodiment, the channel signal encoding unit of the audio encoding device may encode stereophonic frequency signals using another encoding method, instead. For example, the channel signal encoding unit may encode all the frequency signals using the MC encoding method. In this case, the SBR encoding section 17 is omitted in the audio encoding device 1 illustrated in
Multichannel audio signals to be subjected to the encoding are not limited to 5.1ch audio signals. For example, audio signals to be subjected to the encoding may be audio signals of a plurality of channels, namely 3ch, 3.1ch, or 7.1ch. In this case, too, the audio encoding device calculates a frequency signal of each channel by performing the time-frequency transform on the audio signal of each channel. The audio encoding device then downmixes the frequency signal of each channel to generate frequency signals of a number of channels smaller than the number of the original audio signals.
A computer program for causing the computer to realize the function of each component included in the audio encoding device according to each of the above embodiments may be stored in a recording medium such as a semiconductor memory, a magnetic recording medium, or an optical recording medium, and provided.
The audio encoding device according to each of the above embodiments may be mounted on various apparatuses used to transmit or record audio signals, such as a computer, a recording apparatus of video signals, and video transmission apparatus.
(Fourth Embodiment)
These components included in the audio decoding device 100 are formed as separate circuits. Alternatively, these components included in the audio decoding device 100 may be mounted on the audio decoding device 100 as a single integrated circuit in which circuits corresponding thereto are integrated with one another. Alternatively, these components included in the audio decoding device 100 may be function modules realized by a computer program executed by a processor included in the audio decoding device 100.
The separation unit 101 receives an encoded audio signal that has been multiplexed from the outside. The separation unit 101 separates the selection information and the MC code, the SBR code, and the MPS that have been encoded included in the encoded audio signal from one another. The MC code and the SBR code may be referred to as channel encoded signals, and the MPS code may be referred to as encoded spatial information. As a separation method, the method described in ISO/IEC 14496-3 may be used. The separation unit 101 outputs the separated MPS code to the spatial information decoding unit 106, the MC code to the MC decoding section 103, the SBR code to the SBR decoding section 105, and the selection information to the determination section 109.
The spatial information decoding unit 106 receives the MPS code from the separation unit 101. The spatial information decoding unit 106 decodes the MPS code using an example of the quantization table for the degrees of similarity illustrated in
The MC decoding section 103 receives the MC code from the separation unit 101, and then decodes the low-frequency component of the signal of each channel using an MC decoding method and outputs the resultant signals to the time-frequency transform section 104. The MC decoding method may be, for example, the method described in ISO/IEC 13818-7.
The time-frequency transform section 104 transforms the signal of each channel, which is a time signal decoded by the MC decoding section 103, into a frequency signal using, for example, the QMF bank described in ISO/IEC 14496-3, and outputs the frequency signal to the SBR decoding section 105. Alternatively, the time-frequency transform section 104 may perform the time-frequency transform using a complex QMF bank represented by the following expression:
Here, QMF(k, n) is a complex QMF having the time n and the frequency k as variables.
The SBR decoding section 105 decodes the high-frequency component of the signal of each channel using an SBR decoding method. The SBR decoding method may be, for example, the method described in ISO/IEC 14496-3.
The channel signal decoding unit 102 outputs the stereophonic frequency signal of each channel decoded by the MC decoding section 103 and the SBR decoding section 105 to the predictive decoding unit 107.
The predictive decoding unit 107 performs predictive decoding on the left frequency signal L0(k, n), the right frequency signal R0(k, n), or the signal C0(k, n) of the center channel that has been subjected to the predictive coding, on the basis of the channel prediction coefficients received from the spatial information decoding unit 106 and the stereophonic frequency signals received from the channel signal decoding unit 102. For example, when the predictive decoding unit 107 is to perform the predictive decoding on the signal C0(k, n) of the center channel using the stereophonic frequency signals, namely the left frequency signal L0(k, n) and the right frequency signal R0(k, n), and the channel prediction coefficients c1(k) and c2(k), the predictive decoding may be performed using the following expression:
C
0(k, n)=c1(k)·L0(k , n)+c2(k)·R0(k, n) (20)
The predictive decoding unit 107 may perform only the predictive decoding using the channel prediction coefficients received from the spatial information decoding unit 106 and the stereophonic frequency signals received from the channel signal decoding unit 102, and does not have to recognize which of the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel the predictive decoding has been performed for. This is because the determination section 109, which will be described later, may recognize that on the basis of the selection information.
The determination section 109 determines, among the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel, the stereophonic frequency signals and the signal that has been subjected to the predictive decoding on the basis of the selection information received from the separation unit 101, and outputs the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel to the transform section 110 in a certain arrangement. The certain arrangement is an arrangement in which, for example, the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel are arranged in this order from the top as illustrated in
The transform section 110 performs a matrix transform on the left frequency signal L0(k, n), the right frequency signal R0(k, n), and the signal C0(k, n) of the center channel received from the determination section 109 in the certain arrangement using the following expression:
Here, Lout(k, n), Rout(k, n), and Cout(k, n) denote the frequency signals of the left channel, the right channel, and the center channel, respectively. The matrix transform unit 108 outputs the frequency signal Lout(k, n) of the left channel, the frequency signal Rout(k, n) of the right channel, and the frequency signal Cout(k, n) of the center channel subjected to the matrix transform in the transform section 110 to the upmixing unit 111.
The upmixing unit 111 upmixes the frequency signal Lout(k, n) of the left channel, the frequency signal Rout(k, n) of the right channel, and the frequency signal Cout(k, n) of the center channel on the basis of the spatial information received from the spatial information decoding unit 106 and the frequency signal Lout(k, n) of the left channel, the frequency signal Rout(k, n) of the right channel, and the frequency signal Cout(k, n) of the center channel received from the matrix transform unit 108, in order to generate, for example, 5.1ch audio signals. The upmixing method may be, for example, the method described in ISO/IEC 23003-1.
The frequency-time transform unit 112 transforms each signal received from the upmixing unit 111 from the frequency signal to a time signal using a QMF bank represented by the following expression:
Thus, the audio decoding device disclosed in the fourth embodiment may accurately decode the audio signal that has been subjected to the predictive coding and whose error has been suppressed.
(Fifth Embodiment)
In the above embodiments, the components of each device illustrated in the drawings do not have to be physically configured as illustrated. That is, specific modes of separating and integrating each device are not limited to those illustrated in the drawings, and the entirety or a part of each device may be functionally or physically separated or integrated in arbitrary units in accordance with various loads and usage conditions.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2012-147500 | Jun 2012 | JP | national |