The present invention relates to an audio signal processing method and apparatus for encoding or decoding an audio signal.
In general, an audio signal includes signals having various frequencies. The audible frequency range of the human ear is 20 Hz to 20 kHz and human voice is generally in a range of about 200 Hz to 3 kHz.
In encoding of an audio signal having a high frequency band of 7 kHz or more in which human voice is not present, one of a plurality of coding modes or coding schemes is applicable according to audio properties.
If a coding mode or coding scheme which is not suitable for audio properties is applied, sound quality may be deteriorated.
An object of the present invention is to provide an audio signal processing method and apparatus for separately encoding pulses of a signal having high energy in a specific frequency band, such as percussion sound.
Another object of the present invention is to provide an audio signal processing method and apparatus for separately encoding harmonic tracks of a signal having harmonics, such as a string sound.
Another object of the present invention is to provide an audio signal processing method and apparatus for applying a coding mode suitable for audio properties based on a pulse ratio and/or a harmonic ratio.
The present invention provides the following effects and advantages.
First, in the signal having high energy in the specific frequency band, only pulses of the specific frequency band of the signal are separately encoded. Thus, a restoration ratio is higher than that of an encoding mode (generic mode) using only a low frequency band and thus sound quality can be remarkably improved.
Second, in a signal including harmonics, pulses corresponding to harmonics are not respectively encoded, but an overall harmonic track is encoded. Thus, it is possible to increase a restoration ratio without increasing the number of bits.
Third, by adaptively applying one of encoding and decoding schemes corresponding to a total of four modes according to audio properties of frames, it is possible to improve sound quality.
Fourth, in case of applying modified discrete cosine transform (MDCT), since a main pulse and sub pulse adjacent thereto are extracted in the light of the MDCT properties so as to accurately extract a pulse mapped to a specific frequency band, it is possible to increase performance of a non-generic-mode encoding scheme.
Fifth, by extracting and separately quantizing only a best pulse and pulses adjacent thereto from a plurality of harmonic tracks in a harmonic mode, it is possible to reduce the number of bits.
Sixth, in a harmonic mode, since a start position is set to one of a predetermined position with respect to a harmonic track belonging to one group having the same pitch, it is possible to reduce the number of bits in display of start positions of a plurality of harmonic tracks.
According to an aspect of the present invention, there is provided an audio signal processing method including performing frequency conversion with respect to an audio signal so as to acquire a plurality of frequency-converted coefficients, selecting one of a generic mode and a non-generic mode based on a pulse ratio with respect to frequency-converted coefficients of a high frequency band among the plurality of frequency-converted coefficients, and, if the non-generic mode is selected, performing the following steps of extracting a predetermined number of pulses from the frequency-converted coefficients of the high frequency band and generating pulse information, generating an original noise signal excluding the pulses from the frequency-converted coefficients of the high frequency band, generating a reference noise signal using frequency-converted coefficients of a low frequency band among the plurality of frequency-converted coefficients, and generating noise position information and noise energy information using the original noise signal and the reference noise signal.
The pulse ratio may be a ratio of energy of a plurality of pulses to total energy of a current frame.
The extracting the predetermined number of pulses may include extracting a main pulse highest energy, extracting sub pulse adjacent to the main pulse, and excluding the main pulse and the sub pulse from the frequency-converted coefficients of the high frequency band so as to generate a target noise signal, and the extraction of the main pulse and the sub pulse is repeated predetermined times in order to generate the target noise signal.
The pulse information may include at least one of pulse position information, pulse sign information, pulse amplitude information and pulse subband information.
The generating the reference noise signal may include setting a threshold based on total energy of a low frequency band, and excluding pulses exceeding the threshold so as to generate the reference noise signal.
The generating the noise energy information may include generating energy of the predetermined number of pulses, generating energy of the original noise signal, acquiring a pulse ratio using the energy of the pulses and the energy of the original noise signal, and generating the pulse ratio as the noise energy information.
According to another aspect of the present invention, there is provided an audio signal processing apparatus including a frequency conversion unit configured to perform frequency conversion with respect to an audio signal so as to acquire a plurality of frequency-converted coefficients, a pulse ratio determination unit configured to select one of a generic mode and a non-generic mode based on a pulse ratio with respect to frequency-converted coefficients of a high frequency band among the plurality of frequency-converted coefficients, and a non-generic-mode encoding unit configured to operate in the non-generic mode and including a pulse extractor configured to extract a predetermined number of pulses from the frequency-converted coefficients of the high frequency band and to generate pulse information, a reference noise generator configured to generate a reference noise signal using frequency-converted coefficients of a low frequency band among the plurality of frequency-converted coefficients, and a noise search unit configured to generate noise position information and noise energy information using an original noise signal and the reference noise signal, wherein the original noise signal is generated by excluding the pulses from the frequency-converted coefficients of the high frequency band.
According to another aspect of the present invention, there is provided an audio signal processing method including receiving second mode information indicating whether a current frame is in a generic mode or a non-generic mode, receiving pulse information, noise position information and noise energy information if the second mode information indicates that the current frame is in the non-generic mode, generating a predetermined number of pulses with respect to frequency-converted coefficients using the pulse information, generating a reference noise signal using frequency-converted coefficients of a low frequency band corresponding to the noise position information, adjusting energy of the reference noise signal using the noise energy information, and generating frequency-converted coefficients corresponding to a high frequency band using the reference noise signal, the energy of which is adjusted, and the plurality of pulses.
According to another aspect of the present invention, there is provided an audio signal processing method including receiving an audio signal, performing frequency conversion with respect to the audio signal so as to acquire a plurality of frequency-converted coefficients, selecting one of a non-harmonic mode and a harmonic mode based on a harmonic ratio with respect to the frequency-converted coefficients, and, if the harmonic mode is selected, performing the following steps of deciding harmonic tracks of a first group corresponding to a first pitch, deciding harmonic tracks of a second group corresponding to a second pitch, and generating start position information of the plurality of harmonic tracks, wherein the harmonic tracks of the first group include a first harmonic track and a second harmonic track, wherein the harmonic tracks of the second group include a third harmonic track and a fourth harmonic track, wherein start position information of the first harmonic track and the third harmonic track corresponds to one of a first position set, and wherein start position information of the second harmonic track and the fourth harmonic track corresponds to one of a second position set.
The harmonic ratio may be generated based on energy of the plurality of harmonic tracks and energy of the plurality of pulses.
The first position set may correspond to even number positions and the second position set may correspond to odd number positions.
The audio signal processing method may further include generating a first target vector including a best pulse and pulses adjacent thereto in the first harmonic track and a best pulse and pulses adjacent thereto in the second harmonic track, generating a second target vector including a best pulse and pulses adjacent thereto in the third harmonic track and a best pulse and pulses adjacent thereto in the fourth harmonic track, vector-quantizing the first target vector and the second target vector, and performing frequency conversion with respect to a residual part excluding the first target vector and the second target vector from the harmonic tracks.
The first harmonic track may be a set of a plurality of pulses having a first pitch, the second harmonic track may be a set of a plurality of pulses having a first pitch, the third harmonic track may be a set of a plurality of pulses having a second pitch, and the fourth harmonic track may be a set of a plurality of pulses having a second pitch.
The audio signal processing method may further include generating pitch information indicating the first pitch and the second pitch.
According to another aspect of the present invention, there is provided an audio signal processing method including receiving start position information of a plurality of harmonic tracks including harmonic tracks of a first group corresponding to a first pitch and harmonic tracks of a second group corresponding to a second pitch, generating a plurality of harmonic tracks corresponding to the start position information, and generating an audio signal corresponding to a current frame using the plurality of harmonic tracks, wherein the harmonic tracks of the first group include a first harmonic track and a second harmonic track, wherein the harmonic tracks of the second group include a third harmonic track and a fourth harmonic track, wherein start position information of the first harmonic track and the third harmonic track corresponds to one of a first position set, and wherein start position information of the second harmonic track and the fourth harmonic track corresponds to one of a second position set.
According to an aspect of the present invention, there is provided an audio signal processing method including performing frequency conversion with respect to an audio signal so as to acquire a plurality of frequency-converted coefficients, selecting a non-tonal mode and a tonal mode based on inter-frame similarity with respect to the frequency-converted coefficients, selecting one of a generic mode and a non-generic mode based on a pulse ratio if the non-tonal mode is selected, selecting one of a non-harmonic mode and a harmonic mode based on a harmonic ratio if the tonal mode is selected, and encoding the audio signal according to the selected mode so as to generate a parameter, wherein the parameter includes envelope position information and scaling information in the generic mode, wherein the parameter includes pulse information and noise energy information in the non-generic mode, wherein the parameter includes fixed pulse information which is information about fixed pulses, the number of which is predetermined per subband, in the non-harmonic mode, and wherein the parameter includes position information of harmonic tracks of a first group and position information of harmonic tracks of a second group in the harmonic mode.
The audio signal processing method may further include generating first mode information and second mode information according to the selected mode, the first mode information may indicate one of the non-tonal mode and the tonal mode, and the second mode information may indicate one of the generic mode or the non-generic mode if the first mode information indicates the non-tonal mode and indicate one of the non-harmonic mode and the harmonic mode if the first mode information indicates the tonal mode.
According to another aspect of the present invention, there is provided an audio signal processing method including extracting first mode information and second mode information through a bitstream, deciding a current mode corresponding to a current frame based on the first mode information and the second mode information, restoring an audio signal of the current frame using envelope position information and scaling information if the current mode is a generic mode, restoring the audio signal of the current frame using pulse information and noise energy information if the current mode is a non-generic mode, restoring the audio signal of the current frame using fixed pulse information which is information about fixed pulses, the number of which is predetermined per subband, if the current mode is a non-harmonic mode, and restoring the audio signal of the current frame using position information of harmonic tracks of a first group and position information of harmonic tracks of a second group if the current mode is a harmonic mode.
Hereinafter, the exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. The terms used in the present specification and claims are not limited to general meanings thereof and are construed as meanings and concepts suiting the technical spirit of the present invention based on the rule of appropriately defining the concepts of the terms in order to illustrate the invention in the best way possible. The embodiments described in the present specification and the configurations shown in the drawings are merely exemplary and various modifications and equivalents thereof may be made.
In the present invention, the following terms may be construed based on the following criteria and the terms which are not used herein may be construed based on the following criteria. The term coding may be construed as encoding or decoding and the term information includes values, parameters, coefficients, elements, etc. and the meanings thereof may be differently construed according to circumstances and the present invention is not limited thereto.
The term audio signal is differentiated from the term video signal in a broad sense and refers to a signal which is audibly identified upon playback and is differentiated from a speech signal in a narrow sense and refers to a signal in which a speech property is not present or is few. In the present invention, the audio signal is construed in a broad sense and is construed as an audio signal having a narrow sense when used to be differentiated from the speech signal.
The term coding may refer to only encoding or may include encoding and decoding.
In summary, there is a total of four coding modes: 1) a generic mode, 2) a non-generic mode, 3) a non-harmonic mode and 4) a harmonic mode. 1) The generic mode and 2) the non-generic mode correspond to a non-tonal mode and 3) the non-harmonic mode and 4) the harmonic mode correspond to a tonal mode.
A determination as to whether the non-tonal mode or the tonal mode is applied is made by the similarity determination unit 120 according to inter-frame similarity. That is, if similarity is not high, the non-tonal mode is applied and, if similarity is high, the tonal mode is applied. In case of the non-tonal mode, the pulse ratio determination unit 130 determines that 1) the generic mode is applied if a pulse ratio (a ratio of energy of a pulse to total energy) is high and determines that 2) the non-generic mode is applied if the pulse ratio is low.
In addition, in the tonal mode, the harmonic ratio determination unit 160 determines that 3) the non-harmonic mode is applied if a harmonic ratio (a ratio of energy of a harmonic track to energy of a pulse) is not high and that 4) the harmonic mode is applied if the harmonic ratio is high.
The frequency conversion unit 110 performs frequency conversion with respect to an input audio signal so as to acquire a plurality of frequency-converted coefficients. A Modified Discrete Cosine Transform (MDCT) method, a Fast Fourier Transform (FFT) method, etc. may be applied for frequency conversion, but the present invention is not limited thereto.
The frequency-converted coefficients include frequency-converted coefficients corresponding to a relatively low frequency band and frequency-converted coefficients corresponding to a high frequency band. The frequency-converted coefficient of the low frequency band is referred to as a wide band signal, a WB signal or a WB coefficient and the frequency-converted coefficient of the high frequency band is referred to as a super wide band signal, a SWB signal or a WB coefficient. A criterion for dividing the low frequency band and the high frequency band may be about 7 kHz, but the present invention is not limited to a specific frequency.
If the MDCT method is used as the frequency conversion method, a total of 640 frequency-converted coefficients may be generated with respect to an entire audio signal. At this time, about 280 coefficients corresponding to a lowest band may be referred to as a WB signal and about 280 coefficients corresponding to a next band may be referred to as an SWB signal. However, the present invention is not limited thereto.
The similarity determination unit 120 determines inter-frame similarity with respect to an input audio signal. Inter-frame similarity relates to how much the spectrum of the frequency-converted coefficients of a current frame is similar to that of the frequency-converted coefficients of a previous frame. Inter-frame similarity may be referred to as tonality. The description of an equation for inter-frame similarity will be omitted.
As the result of determining inter-frame similarity via the similarity determination unit 120, a low-similarity signal is similar to noise and corresponds to a non-tonal mode and a high-similarity signal is different from noise and corresponds to a tonal mode. First mode information indicating whether a frame corresponds to a non-tonal mode or a tonal mode is generated and sent to a decoder.
If it is determined that the frame corresponds to the non-tonal mode (e.g., if the first mode information is 0), the frequency-converted coefficients of the high frequency band are sent to the pulse ratio determination unit 130 and, if it is determined that the frame corresponds to the tonal mode (e.g., if the first mode information is 1), the coefficients are sent to the harmonic ratio determination unit 160.
Referring to
The pulse ratio determination unit 130 determines a generic mode or a non-generic mode based on a ratio of energy of a plurality of pulses to total energy of a current frame. The term pulse refers to a coefficient having relatively high energy in a domain (e.g., an MDCT domain) of a frequency-converted coefficient.
Since a process of extracting pulses having high energy from a domain of a frequency-converted coefficient by the pulse ratio determination unit 130 may be equal to a pulse extraction process performed when a coding method of a non-generic mode is applied, the detailed configuration of the non-generic-mode encoding unit 150 will be described below.
If a total of eight pulses is extracted, this may be expressed as follows.
P(j)=max({M32(k+280)}2),j=0, . . . ,7k=280, . . . ,560 [Equation 1]
where, M32(k) are an SWB coefficient (a frequency-converted coefficient of a high frequency band), k is an index of a frequency-converted coefficient, P(j) is a pulse (or a peak), and j is a pulse index.
The pulse ratio may be expressed by the following equation.
where, RpeakS is a pulse ratio, Epeak is the total energy of a pulse, and Etotal is total energy.
If the pulse ratio does not exceed a specific reference value (e.g., 0.6) after the pulse ratio RpeakS is estimated, the signal is determined as the generic mode and, if the pulse ratio exceeds the reference value, the signal is determined as the non-generic mode.
Referring to
The detailed configurations of the harmonic ratio determination unit 160, the non-harmonic-mode encoding unit 170 and the harmonic-mode encoding unit 180 will be described with reference to other drawings.
First, referring to
The normalization unit 142 normalizes the envelope of the WB signal in a logarithmic domain. Since the WB signal should be confirmed even by a decoder, the WB signal is preferably a signal restored using the encoded WB signal. Since the envelope of the WB signal is rapidly changed, quantization of two scaling factors cannot be accurately performed and thus a normalization process in the logarithmic domain may be necessary.
The subband generator 144 divides the SWB signal into a plurality (e.g., four) of subbands. For example, if the total number of frequency-converted coefficients of the SWB signal is 280, the subbands may have 40, 70, 70 and 100 coefficients, respectively.
The search unit 146 searches the normalized envelope of the WB signal so as to calculate similarity with each subband of the SWB signal and determines a best similar WB signal having an envelope section similar to each subband based on the similarity. A start position of the best similar WB signal is generated as envelope position information.
Then, the search unit 146 may determine two pieces of scaling information in order to make the best similar WB signal audibly similar to an original SWB signal. At this time, first scaling information may be determined per subband in a linear domain and may be determined per subband in the logarithmic domain.
The generic-mode encoding unit 140 encodes the SWB signal using the envelope of the WB signal and generates envelope position information and scaling information.
Referring to
As the scaling information, per-subband scaling sign information of a total of 4 bits, (a total of four pieces of) first per-subband scaling information of a total of 16 bits may be allocated and a total of four pieces of second per-subband scaling information are vector-quantized based on an 8-bit codebook and second per-subband scaling information of a total of 8 bits may be allocated. However, the present invention is not limited thereto.
Hereinafter, the encoding process in the non-generic mode will be described with reference to
The pulse extractor 152 extracts a predetermined number of pulses from the frequency-converted coefficients (SWB signal) of the high frequency band and generates pulse information (e.g., pulse position information, pulse sign information, pulse amplitude information, etc.). This pulse is similar to the pulse defined in the above-described pulse ratio determination unit 130. Hereinafter, an embodiment of a pulse extraction process will be described in detail with reference to
First, the pulse extractor 152 divides the SWB signal into a plurality of subband signals as follows. At this time, each subband may correspond to a total of 64 frequency-converted coefficients.
M
32
0(k)=M32(k+280),k=0, . . . ,63
M
32
1(k)=M32(k+344),k=0, . . . ,63
M
32
2(k)=M32(k+408),k=0, . . . ,63
M
32
3(k)=M32(k+472),k=0, . . . ,63 [Equation 3]
M320(k) is a first subband of the SWB signal.
Then, per-subband energy is calculated as follows.
E0 is energy of the first subband.
Then, any one of subbands (j is any one of 0, 1, 2 and 3) respectively having highest energy E0, E1, E2 and E3 is selected. Referring to
Then, a pulse having highest energy in the subband is set as a main pulse. Then, between two pulses adjacent to the main pulse, that is, between left and right pulses of the main pulse, a pulse having high energy is set as a sub pulse. Referring to
In particular, a process of extracting the main pulse and the sub pulse adjacent thereto is preferable when the frequency-converted coefficients are generated through MDCT. This is because MDCT is sensitive to time shift and has phase-variant. Accordingly, since frequency resolution is not accurate, one specific frequency may not correspond to one MDCT coefficient and may correspond to two or more MDCT coefficients. Accordingly, in order to more accurately extract a pulse from an MDCT domain, only the main pulse of the MDCT is not extracted, but the sub pulse adjacent thereto is additionally extracted.
Since the sub pulse is adjacent to the left side or the right side of the main pulse, the position information of the sub pulse can be encoded using only 1 bit indicating the left side or the right side of the main pulse and the pulse can be more accurately estimated using a relatively small number of bits.
The process of extracting the main pulse and the sub pulse is logically summarized as follows. The present invention is not limited to the following expression.
The pulse extractor 152 excludes the main pulse and the sub pulse of the first set extracted from the SWB signal so as to generate a target noise signal.
Referring to
The pulse extractor 152 extracts the predetermined number of pulses as described above and then generates information about the pulses. Although the total number of pulses may be for example eight (a total of three sets of main pulses and sub pulses and a total of three separate pulses), the present invention is not limited thereto. The information about the pulses may include at least one of pulse position information, pulse sign information, pulse amplitude information and pulse subband information. The pulse subband information indicates to which subband the pulse belongs.
Accordingly, in order to encode the pulse subband information, 2 bits are necessary to express a first set, 2 bits are necessary to express a second set, 2 bits are necessary to express a third set, 2 bits are necessary to express a first separate pulse and 2 bits are necessary to express a second separate pulse. That is, a total of 10 bits is necessary.
In addition, since the pulse position information indicates in which coefficient a pulse is present in a specific subband, 6 bits are consumed for each of the first to third sets, 6 bits are consumed for the first separate pulse and 6 bits are consumed for the second separate pulse. That is, a total of 30 bits is consumed.
In the pulse sign information, 1 bit is consumed for each pulse, that is, a total of 8 bits is consumed. A total of 16 bits is allocated to the pulse amplitude information by vector-quantizing the amplitude information of four pulses using an 8-bit codebook.
Referring to
The reference noise generator 154 of
The reference noise generator 154 generates a reference noise signal {tilde over (M)}16 using the WB signal through the above process.
The noise search unit 156 of
First, the original noise signal (the signal obtained by excluding the pulses from the SWB signal) is divided into a plurality of subband signals as follows.
{tilde over (M)}
32
0(k)={tilde over (M)}32(k+280),k=0, . . . ,39
{tilde over (M)}
32
1(k)={tilde over (M)}32(k+320),k=0, . . . ,69
{tilde over (M)}
32
2(k)={tilde over (M)}32(k+390),k=0, . . . ,69
{tilde over (M)}
32
3(k)={tilde over (M)}32(k+460),k=0, . . . ,99 [Equation 5]
The size of each subband may be the same as the above-described subband in the generic mode. The length dj(k) j=0, . . . , 3 of the subband may correspond to 40, 70, 70 and 100 frequency-converted coefficients. All subbands have different search start positions kj and different search ranges wj and similarity with the reference noise signal {tilde over (M)}16 is is detected. The search start position kj is fixed to 0 in case of j=0, 2 and depends on the start position of a subband having best similarity of a previous subband in case of J=1, 3. The search start position kj and search range wj of a j-th subband may be expressed as follows.
kj is a search start position, BestIdxj is a best similarity start position, dj is the length of a subband, and wj is a search range.
If kj becomes a negative number, kj is corrected to 0 and, if kj becomes greater than 280−dj−wj, kj is corrected to 280−dj−wj. The best similarity start position BestIndxj is estimated per subband through the following process.
First, similarity corr(k′) corresponding to a similarity index k′ is calculated by the following equation. Encoding is performed using a method similar to that of the generic mode, but searching is performed in units of four samples, not in units of one sample (one coefficient).
corr(k′) is similarity, M32j(k) is original noise (see Equation 5), {tilde over (M)}16 is reference noise, kj is a search start position, k′ is a similarity index and wj is a search range.
Energy corresponding to the similarity index k′ is calculated by the following equation.
Substantial similarity S(k′) is expressed by the following equation.
The start position BestIdxj of a subband in which the substantial similarity S(k′) has a best value is calculated as follows. BestIdxj is converted into a parameter LagIndexj and is included in a bitstream as noise position information.
Up to now, the process of generating the noise position information by the noise search unit 156 was described. Hereinafter, a process of generating noise energy information will be described. The reference noise signal may have a waveform similar to that of the original noise signal, but may have energy different from that of the original noise signal. It is necessary to generate and transmit noise energy information which is information about the energy of the original noise signal to the decoder such that the decoder has a noise signal having energy similar to that of the original noise signal.
The value of the noise energy may be converted into a pulse ratio value and may be transmitted, since dynamic range is large. Since the pulse ratio is a percentage of 0% to 100%, dynamic range is small and thus the number of bits may be reduced. This conversion process will be described.
The energy of the noise signal is equal to a value obtained by excluding pulse energy from the total energy of the SWB signal as shown in the following equation.
Noiseenergy is noise energy, M32 is an SWB signal, and {circumflex over (P)}energy is pulse energy
The above equation is expressed by a pulse ratio {circumflex over (R)}percent which is a percentage as follows.
{circumflex over (R)}percent is a pulse ratio, {circumflex over (P)}energy is pulse energy, and Noiseenergy is noise energy.
That is, the encoder transmits the pulse ratio {circumflex over (R)}percent shown in Equation 11, instead of the noise energy Noiseenergy shown in Equation 10. Noise energy information corresponding to this pulse ratio may be encoded using 4 bits as shown in
Then, first, the decoder generates pulse energy
based on the pulse information generated by the pulse extractor 152. Then, the pulse energy {circumflex over (P)}energy and the transmitted pulse ratio {circumflex over (R)}percent are substituted into the following equation so as to generate noise energy Noiseenergy.
Equation 12 is obtained by rearranging Equation 11.
The decoder may convert the transmitted pulse ratio into the noise energy as described above and multiply the noise energy and each coefficient of the reference noise signal so as to acquire a noise signal having an energy distribution similar to the original noise signal using the reference noise signal.
The noise search unit 156 generates noise position information through the above process, converts a noise energy value into a pulse ratio, and transmits the pulse ratio to the decoder as the noise energy information.
That is, if the energy of a predetermined pulse is high according to the property of an audio signal, it is possible to increase sound quality without substantially increasing the number of bits by performing encoding in the non-generic mode according to the embodiment of the present invention.
Hereinafter, the harmonic ratio determination unit 150, the non-harmonic-mode encoding unit 170 and the harmonic-mode encoding unit 180 shown in
First,
The harmonic track extractor 162 extracts a harmonic track from frequency-converted coefficients corresponding to a high frequency band. This process performs the same process as the harmonic track extractor 182 of the harmonic-mode encoding unit 180 and thus will be described in detail below.
The fixed pulse extractor 164 extracts a predetermined number of pulses decided in a predetermined region (164). This process performs the same process as the fixed pulse extractor 172 of the non-harmonic-mode encoding unit 170 and thus will be described in detail below.
The harmonic ratio decision unit 166 decides a non-harmonic mode if a harmonic ratio which is a ratio of fixed pulse energy to the energy sum of the extracted tracks is low and decides a harmonic mode if the harmonic ratio is high. As described above, the non-harmonic-mode encoding unit 170 is activated in the non-harmonic mode and the harmonic-mode encoding unit 180 is activated in the harmonic mode.
First, referring to
The fixed pulse extractor 172 extracts a fixed number of fixed pulses from a fixed region as shown in
D(k)=|{umlaut over (M)}32(k)−M32(k)|,k=280, . . . ,560 [Equation 14]
where, M32(k) is an SWB signal and {umlaut over (M)}32(k) is an HF synthesis signal.
The HF synthesis signal {umlaut over (M)}32(k) is not present and thus is set to 0. In addition, a process of finding a maximum value of M32(k) is performed. D(k) is divided into 5 subbands so as to make Dj and the number of pulses of each subband has a predetermined value Nj. A process of finding Nj largest values per subband is performed as follows. The following algorithm is an alignment algorithm for finding and storing a maximum value N in a sequence input_data.
Referring to
The reason for extracting the fixed pulse, that is, the reason for extracting the predetermined number of pulses at a predetermined position, is because the number of bits corresponding to the position information of the fixed pulse is saved.
Referring to
Hereinafter, a harmonic mode encoding process will be described with reference to
Referring to
The harmonic track extractor 182 extracts a plurality of harmonic tracks from the frequency-converted coefficients corresponding to a high frequency band. More specifically, harmonic tracks (a first harmonic track and a second harmonic track) of a first group corresponding to a first pitch are extracted and harmonic tracks (a third harmonic track and a fourth harmonic track) of a second group corresponding to a second pitch are extracted. Start position information of the first harmonic track and the third harmonic track may correspond to one of the first position set (e.g., an odd number) and start position information of the second harmonic track and the fourth harmonic track may correspond to one of the second position set (e.g., an even number).
Referring to
The above-described plurality of harmonic tracks may be obtained through the following equation.
D(k)=|{umlaut over (M)}32(k)−M32(k)|,k=280, . . . ,560 [Equation 14]
where, M32 (k) is an SWB signal and {umlaut over (M)}32(k) is an HF synthesis signal.
Since the HF synthesis signal is not present, if an initial value is set to 0, a process of finding a maximum value of M32(k) is performed.
D(k) is expressed by a sum of a predetermined number (e.g., a total of four) of harmonic tracks. Each harmonic track Dj may include two or more pitch components as a maximum and two harmonic tracks Dj may be extracted from one pitch component. A process of finding the harmonic track Dj having two largest values per pitch component is as follows.
The following equation finds a pitch Pi of a harmonic track Dj including highest energy using an autocorrelation function. A pitch range may be restricted to coefficients of 20 to 27 of the frequency-converted coefficients so as to restrict the number of extracted harmonics.
The following equation is a process of calculating a start position PSi of a total of two harmonic tracks Dj including highest energy per pitch Pi so as to extract the harmonic track Dj. The range of the start positions PSi of the harmonic tracks Dj is calculated by including the number of extracted harmonics and a total of two harmonic tracks Dj is extracted by two start positions PSi per the pitch Pi according to the property of an MDCT domain signal.
The pitch Pi of the four extracted harmonic tracks Dj and the range and number of start positions PSi are shown in
The harmonic information encoding unit 184 encodes and vector-quantizes the above-described information about the harmonic tracks.
The harmonic tracks extracted in the above process have pitch Pi and the position information of the start positions PSi. The extracted pitch Pi and the start positions PSi are encoded as follows. The pitch Pi is quantized using 3 bits by restricting the number of harmonics which may be present in HF and the start positions PSi are respectively quantized using four bits. Although a total of 22 bits may be used as position information for extracting a total of four harmonic tracks by using start positions PSi of two pitches Pi, the present invention is not limited thereto.
The four harmonic tracks extracted by the above process include a maximum of 44 pulses. In order to quantize the amplitude values and sign information of the 44 pulses, many bits are necessary. Accordingly, pulses including high energy are extracted from the pulses of each harmonic track using a pulse peak extraction algorithm and the amplitude values and sign information are separately encoded as shown in the following equation.
The following algorithm is an algorithm for extracting pulse peak PPi from each harmonic track, which finds contiguous pulses including high energy, quantizes the amplitude values, and separately encodes the sign information as shown in the following equation. 3 bits are used to extract a pulse peak from each harmonic track, the amplitude values of four pulses extracted from two harmonic tracks are quantized using 8 bits, and 1 bit is allocated to sign information. The pulses extracted through the pulse peak extraction algorithm are quantized to a total of 24 bits.
The harmonic tracks excluding the 8 pulses extracted by the above process are combined to one track and the amplitude value and sign information thereof are simultaneously quantized using DCT. For DCT quantization, 19 bits are used.
A process of encoding the pulses extracted through the pulse peak extraction algorithm of the four extracted harmonic tracks and the harmonic tracks excluding the pulses is shown in
An example of information about the above-described harmonic track is shown in
The mode decision unit 210 decides a mode corresponding to a current frame, that is, a current mode, based on first mode information and second mode information received through a bitstream. The first mode information indicates one of the non-tonal mode and the tonal mode and the second mode information indicates one of a generic mode or a non-generic mode if the first mode information indicates the non-tonal mode, similarly to the above-described encoder 100.
One of four decoding units 220, 230, 240 and 250 is activated in a current frame according to the decided current mode and a parameter corresponding to each mode is extracted by the demultiplxer (not shown) according to the current mode.
If the current mode is a generic mode, envelope position information, scaling information, etc. are extracted. Then, the generic-mode decoding unit 220 extracts a section corresponding to the envelope position information, that is, an envelope of a best similar band, from frequency-converted coefficients (WB signal) of a restored low frequency band. Then, the envelope is scaled using the scaling information so as to restore a high frequency band (SWB signal) of the current frame.
If the current mode is a non-generic mode, pulse information, noise position information, noise energy information, etc. are extracted. Then, the non-generic-mode decoding unit 230 generates a plurality of pulses (e.g., a total of three sets of main pulses and sub pulses and two separate pulses) based on the pulse information. The pulse information may include pulse position information, pulse sign information and pulse amplitude information. The sign of each pulse is decided according to the pulse sign information. The amplitude and position of each pulse is decided according to the pulse amplitude information and the pulse position information. Then, a section to be used as noise in the restored WB signal is decided using the noise position information, noise energy is adjusted using the noise energy information, and the pulses are summed, thereby restoring the SWB signal of the current frame.
If the current mode is a non-harmonic mode, fixed pulse information is extracted. The non-harmonic-mode decoding unit 240 acquires a position set per subband and predetermined number of fixed pulses using the fixed pulse information. The SWB signal of the current frame is generated using the fixed pulses.
If the current mode is a harmonic mode, position information of the harmonic track, etc. is extracted. The position information of the harmonic track includes start position information of harmonic tracks of a first group having a first pitch and start position information of harmonic tracks of a second group having a second pitch. The harmonic tracks of the first group may include a first harmonic track and a second harmonic track and the harmonic tracks of the second group may include a third harmonic track and a fourth harmonic track. The start position information of the first harmonic track and the third harmonic track may correspond to one of a first position set and the start position information of the second harmonic track and the fourth harmonic track may correspond to one of a second position set.
Pitch information indicating the first pitch and the second pitch may be further received. The harmonic-mode decoding unit 250 generates a plurality of harmonic tracks corresponding to the start position information using the pitch information and the start position information and generates an audio signal corresponding to the current frame, that is, an SWB signal, using the plurality of harmonic tracks.
The audio signal processing apparatus according to the present invention may be included in various products. Such products may be largely divided into a stand-alone group and a portable group. The stand-alone group may include a TV, a monitor, a set top box, etc. and the portable group may include a PMP, a mobile phone, a navigation system, etc.
A user authenticating unit 520 receives user information and performs user authentication and may include a fingerprint recognizing unit 520A, an iris recognizing unit 520B, a face recognizing unit 520C and a voice recognizing unit 520D, all of which respectively receive and convert fingerprint information, iris information, face contour information and voice information into user information and determine whether the user information matches previously registered user data so as to perform user authentication.
An input unit 530 enables a user to input various types of commands and may include at least one of a keypad unit 530A, a touch pad unit 530B and a remote controller unit 530C, to which the present invention is not limited.
A signal coding unit 540 encodes and decodes an audio signal and/or a video signal received through the wired/wireless communication unit 510 and outputs an audio signal of a time domain. The signal coding unit includes an audio signal processing apparatus 545 corresponding to the above-described embodiment of the present invention (the encoder 100 and/or the decoder 200 according to the first embodiment or the encoder 300 and/or the decoder 400 according to the second embodiment). The audio signal processing apparatus 545 and the signal coding unit including the same may be implemented by one or more processors.
A control unit 550 receives input signals from input devices and controls all processes of the signal decoding unit 540 and the output unit 560. The output unit 560 is a component for outputting an output signal generated by the signal decoding unit 540 and includes a speaker unit 560A and a display unit 560B. When the output signal is an audio signal, the output signal is output through a speaker and, if the output signal is a video signal, the output signal is output through the display.
The audio signal processing apparatus according to the present invention may be made as a computer-executable program and stored in a computer-readable recording medium, and multimedia data having a data structure according to the present invention may be stored in a computer-readable recording medium. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disc, optical data storage, and a carrier wave (e.g., data transmission over the Internet). A bitstream generated by the encoding method may be stored in a computer-readable recording medium or transmitted over a wired/wireless communication network.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
The present invention is applicable to encoding and decoding of an audio signal.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/KR2011/000324 | 1/17/2011 | WO | 00 | 11/7/2012 |
Number | Date | Country | |
---|---|---|---|
61295170 | Jan 2010 | US | |
61349192 | May 2010 | US | |
61377448 | Aug 2010 | US | |
61426502 | Dec 2010 | US |