1. Field
Apparatuses and methods consistent with exemplary embodiments relate to audio encoding and decoding, and more particularly, to a method and an apparatus for determining an encoding mode for improving the quality of a reconstructed audio signal, by determining an encoding mode appropriate to characteristics of an audio signal and preventing frequent encoding mode switching, a method and an apparatus for encoding an audio signal, and a method and an apparatus for decoding an audio signal.
2. Description of the Related Art
It is widely known that it is efficient to encode a music signal in the frequency domain and it is efficient to encode a speech signal in the time domain. Therefore, various techniques for classifying the type of an audio signal, in which the music signal and the speech signal are mixed, and determining an encoding mode in correspondence to the classified type have been suggested.
However, due to frequency encoding mode switching, not only delays occur, but also decoded sound quality is deteriorated. Furthermore, since there is no technique for modifying a primarily determined encoding mode, if an error occurs during determination of an encoding mode, the quality of a reconstructed audio signal is deteriorated.
Aspects of one or more exemplary embodiments provide a method and an apparatus for determining an encoding mode for improving the quality of a reconstructed audio signal, by determining an encoding mode appropriate to characteristics of an audio signal, a method and an apparatus for encoding an audio signal, and a method and an apparatus for decoding an audio signal.
Aspects of one or more exemplary embodiments provide a method and an apparatus for determining an encoding mode appropriate to characteristics of an audio signal and reducing delays due to frequent encoding mode switching, a method and an apparatus for encoding an audio signal, and a method and an apparatus for decoding an audio signal.
Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments.
According to an aspect of one or more exemplary embodiments, there is a method of determining an encoding mode, the method including determining one from among a plurality of encoding modes including a first encoding mode and a second encoding mode as an initial encoding mode in correspondence to characteristics of an audio signal, and if there is an error in the determination of the initial encoding mode, generating a modified encoding mode by modifying the initial encoding mode to a third encoding mode.
According to an aspect of one or more exemplary embodiments, there is a method of encoding an audio signal, the method including determining one from among a plurality of encoding modes including a first encoding mode and a second encoding mode as an initial encoding mode in correspondence to characteristics of an audio signal, if there is an error in the determination of the initial encoding mode, generating a modified encoding mode by modifying the initial encoding mode to a third encoding mode, and performing different encoding processes on the audio signal based on either the initial encoding mode or the modified encoding mode.
According to an aspect of one or more exemplary embodiments, there is a method of decoding an audio signal, the method including parsing a bitstream comprising one of an initial encoding mode obtained by determining one from among a plurality of encoding modes including a first encoding mode and a second encoding mode in correspondence to characteristics of an audio signal and a third encoding mode modified from the initial encoding mode if there is an error in the determination of the initial encoding mode, and performing different decoding processes on the bitstream based on either the initial encoding mode or the third encoding mode.
These and/or other aspects will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings in which:
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In this regard, the present embodiments may have different forms and should not be construed as being limited to the descriptions set forth herein. Accordingly, the embodiments are merely described below, by referring to the figures, to explain aspects of the present description.
Terms such as “connected” and “linked” may be used to indicate a directly connected or linked state, but it shall be understood that another component may be interposed therebetween.
Terms such as “first” and “second” may be used to describe various components, but the components shall not be limited to the terms. The terms may be used only to distinguish one component from another component.
The units described in exemplary embodiments are independently illustrated to indicate different characteristic functions, and it does not mean that each unit is formed of one separate hardware or software component. Each unit is illustrated for the convenience of explanation, and a plurality of units may form one unit, and one unit may be divided into a plurality of units.
The audio encoding apparatus 100 shown in
Referring to
As described above, by determining the final encoding mode of a current frame based on modification of the initial encoding mode and encoding modes of frames corresponding to a hangover length, an encoding mode adaptive to characteristics of an audio signal may be selected while preventing frequent encoding mode switching between frames.
Generally, the time domain encoding, that is, the time domain excitation encoding may be efficient for a speech signal, the spectrum domain encoding may be efficient for a music signal, and the frequency domain excitation encoding may be efficient for a vocal and/or harmonic signal.
In correspondence to an encoding mode determined by the encoding mode determining unit 110, the switching unit 120 may provide an audio signal to either the spectrum domain encoding unit 130 or the linear prediction domain encoding unit 140. If the linear prediction domain encoding unit 140 is embodied as the time domain excitation encoding unit 141, the switching unit 120 may include total two branches. If the linear prediction domain encoding unit 140 is embodied as the time domain excitation encoding unit 141 and the frequency domain excitation encoding unit 143, the switching unit 120 may have total 3 branches.
The spectrum domain encoding unit 130 may encode an audio signal in the spectrum domain. The spectrum domain may refer to the frequency domain or a transform domain. Examples of coding methods applicable to the spectrum domain encoding unit 130 may include an advance audio coding (AAC), or a combination of a modified discrete cosine transform (MDCT) and a factorial pulse coding (FPC), but are not limited thereto. In detail, other quantizing techniques and entropy coding techniques may be used instead of the FPC. It may be efficient to encode a music signal in the spectrum domain encoding unit 130.
The linear prediction domain encoding unit 140 may encode an audio signal in a linear prediction domain. The linear prediction domain may refer to an excitation domain or a time domain. The linear prediction domain encoding unit 140 may be embodied as the time domain excitation encoding unit 141 or may be embodied to include the time domain excitation encoding unit 141 and the frequency domain excitation encoding unit 143. Examples of coding methods applicable to the time domain excitation encoding unit 141 may include code excited linear prediction (CELP) or an algebraic CELP (ACELP), but are not limited thereto. Examples of coding methods applicable to the frequency domain excitation encoding unit 143 may include general signal coding (GSC) or transform coded excitation (TCX), are not limited thereto. It may be efficient to encode a speech signal in the time domain excitation encoding unit 141, whereas it may be efficient to encode a vocal and/or harmonic signal in the frequency domain excitation encoding unit 143.
The bitstream generating unit 150 may generate a bitstream to include the encoding mode provided by the encoding mode determining unit 110, a result of encoding provided by the spectrum domain encoding unit 130, and a result of encoding provided by the linear prediction domain encoding unit 140.
The audio encoding apparatus 200 shown in
Referring to
According to an exemplary embodiment, at the common pre-processing module 205, the bandwidth extension processing may be differently performed based on encoding domains. The audio signal in a core band may be processed by using the time domain excitation encoding mode or the frequency domain excitation encoding mode, whereas an audio signal in a bandwidth extended band may be processed in the time domain. The bandwidth extension processing in the time domain may include a plurality of modes including a voiced mode or an unvoiced mode. Alternatively, an audio signal in the core band may be processed by using the spectrum domain encoding mode, whereas an audio signal in the bandwidth extended band may be processed in the frequency domain. The bandwidth extension processing in the frequency domain may include a plurality of modes including a transient mode, a normal mode, or a harmonic mode. To perform bandwidth extension processing in different domains, an encoding mode determined by the encoding mode determining unit 110 may be provided to the common pre-processing module 205 as a signaling information. According to an exemplary embodiment, the last portion of the core band and the beginning portion of the bandwidth extended band may overlap each other to some extent. Location and size of the overlapped portions may be set in advance.
The encoding mode determining unit 300 shown in
Referring to
The encoding mode modifying unit 330 may determine a modified encoding mode by modifying the initial encoding mode determined by the initial encoding mode determining unit 310 by using modification parameters. According to an exemplary embodiment, if the spectrum domain encoding mode is determined as the initial encoding mode, the initial encoding mode may be modified to the frequency domain excitation encoding mode based on modification parameters. If the time domain encoding mode is determined as the initial encoding mode, the initial encoding mode may be modified to the frequency domain excitation encoding mode based on modification parameters. In other words, it is determined whether there is an error in determination of the initial encoding mode by using modification parameters. If it is determined that there is no error in the determination of the initial encoding mode, the initial encoding mode may be maintained. On the contrary, if it is determined that there is an error in the determination of the initial encoding mode, the initial encoding mode may be modified. The modification of the initial encoding mode may be obtained from the spectrum domain encoding mode to the frequency domain excitation encoding mode and from the time domain excitation encoding mode to frequency domain excitation encoding mode.
Meanwhile, the initial encoding mode or the modified encoding mode may be a temporary encoding mode for a current frame, where the temporary encoding mode for the current frame may be compared to encoding modes for previous frames within a preset hangover length and the final encoding mode for the current frame may be determined.
The initial encoding mode determining unit 400 shown in
Referring to
First, a first feature parameter F1 relates to a pitch parameter, where a behavior of pitch may be determined by using N pitch values detected in a current frame and at least one previous frame. To prevent an effect from a random deviation or a wrong pitch value, M pitch values significantly different from the average of the N pitch values may be removed. Here, N and M may be values obtained via experiments or simulations in advance. Furthermore, N may be set in advance, and a difference between a pitch value to be removed and the average of the N pitch values may be determined via experiments or simulations in advance. The first feature parameter F1 may be expressed as shown in Equation 1 below by using the average mp and the variance σp′ with respect to (N−M) pitch values.
A second feature parameter F2 also relates to a pitch parameter and may indicate reliability of a pitch value detected in a current frame. The second feature parameter F2 may be expressed as shown in Equation 2 bellow by using variances σSF1 and 94SF2 of pitch values respectively detected in two sub-frames SF1 and SF2 of a current frame.
Here, cov(SF1,SF2) denotes the covariance between the sub-frames SF1 and SF2. In other words, the second feature parameter F2 indicates correlation between two sub-frames as a pitch distance. According to an exemplary embodiment, a current frame may include two or more sub-frames, and Equation 2 may be modified based on the number of sub-frames.
A third feature parameter F3 may be expressed as shown in Equation 3 below based on a voicing parameter Voicing and a correlation parameter Corr.
Here, the voicing parameter Voicing relates to vocal features of sound and may be obtained any of various methods known in the art, whereas the correlation parameter Corr may be obtained by summing correlations between frames for each band.
A fourth feature parameter F4 relates to a linear prediction error ELPC and may be expressed as shown in Equation 4 below.
Here, M(ELPC) denotes the average of N linear prediction errors.
The determining unit 430 may determine the type of an audio signal by using at least one feature parameter provided by the feature parameter extracting unit 410 and may determine the initial encoding mode based on the determined type. The determining unit 430 may employ soft decision mechanism, where at least one mixture may be formed per feature parameter. According to an exemplary embodiment, the type of an audio signal may be determined by using the Gaussian mixture model (GMM) based on mixture probabilities. A probability f(x) regarding one mixture may be calculated according to Equation 5 below.
Here, x denotes an input vector of a feature parameter, m denotes a mixture, and c denotes a covariance matrix.
The determining unit 430 may calculate a music probability Pm and a speech probability Ps by using Equation 6 below.
Here, the music probability Pm may be calculated by adding probabilities Pi of M mixtures related to feature parameters superior for music determination, whereas the speech probability Ps may be calculated by adding probabilities Pi of S mixtures related to feature parameters superior for speech determination.
Meanwhile, for improved precision, the music probability Pm and the speech probability Ps may be calculated according to Equation 7 below.
Here, Pierr denotes error probability of each mixture. The error probability may be obtained by classifying training data incuding clean speech signals and clean music signals using each of mixtures and counting the number of wrong classifications.
Next, the probability PM that all frames include music signals only and the speech probability PS that all frames include speech signals only with respect to a plurality of frames as many as a constant hangover length may be calculated according to Equation 8 below. The hangover length may be set to 8, but is not limited thereto. Eight frames may include a current frame and 7 previous frames.
Next, a plurality of conditions sets {DiM} and {DiS} may be calculated by using the music probability Pm or the speech probability Ps obtained using Equation 5 or Equation 6. Detailed descriptions thereof will be given below with reference to
Referring to
In an operation 630, the sum of music conditions M is compared to a designated threshold value Tm. If the sum of music conditions M is greater than the threshold value Tm, an encoding mode of a current frame is switched to a music mode, that is, the spectrum domain encoding mode. If the sum of music conditions M is smaller than or equal to the threshold value Tm, the encoding mode of the current frame is not changed.
In an operation 640, the sum of speech conditions S is compared to a designated threshold value Ts. If the sum of speech conditions S is greater than the threshold value Ts, an encoding mode of a current frame is switched to a speech mode, that is, the linear prediction domain encoding mode. If the sum of speech conditions S is smaller than or equal to the threshold value Ts, the encoding mode of the current frame is not changed.
The threshold value Tm and the threshold value Ts may be set to values obtained via experiments or simulations in advance.
An initial encoding mode determining unit 500 shown in
In
The spectral parameter extracting unit 520 may extract at least one spectral parameter from a frequency domain audio signal provided by the transform unit 510. Spectral parameters may be categorized into short-term feature parameters and long-term feature parameters. The short-term feature parameters may be obtained from a current frame, whereas the long-term feature parameters may be obtained from a plurality of frames including the current frame and at least one previous frame.
The temporal parameter extracting unit 530 may extract at least one temporal parameter from a time domain audio signal. Temporal parameters may also be categorized into short-term feature parameters and long-term feature parameters. The short-term feature parameters may be obtained from a current frame, whereas the long-term feature parameters may be obtained from a plurality of frames including the current frame and at least one previous frame.
A determining unit (430 of
Referring to
In an operation 701, if it is determined in the operation 700 that the initial encoding mode is the spectrum domain mode (stateTS==1), an index stateTTSS indicating whether the frequency domain excitation encoding is more appropriate may be checked. The index stateTTSS indicating whether the frequency domain excitation encoding (e.g., GSC) is more appropriate may be obtained by using tonalities of different frequency bands. Detailed descriptions thereof will be given below.
Tonality of a low band signal may be obtained as a ratio between a sum of a plurality of spectrum coefficients having small values including the smallest value and the spectrum coefficient having the largest value with respect to a given band. If given bands are 0˜1 kHz, 1˜2 kHz, and 2˜4 kHz, tonalities t01, t12, and t24 of the respective bands and tonality tL of a low band signal, that is, the core band may be expressed as shown in Equation 10 below.
Meanwhile, the linear prediction error err may be obtained by using a linear prediction coding (LPC) filter and may be used to remove strong tonal components. In other words, the spectrum domain encoding mode may be more efficient with respect to strong tonal components than the frequency domain excitation encoding mode.
A front condition condfront for switching to the frequency domain excitation encoding mode by using the tonalities and the linear prediction error obtained as described above may be expressed as shown in Equation 11 below.
condfront=t12>t12front and t24>t24front and tL>tLfront and err>errfront [Equation 11]
Here, t12front, t24front, tLfront, and errfront are threshold values and may have values obtained via experiments or simulations in advance.
Meanwhile, a back condition condback for finishing the frequency domain excitation encoding mode by using the tonalities and the linear prediction error obtained as described above may be expressed as shown in Equation 12 below.
condback=t12<t12back and t24<t24back and tL<tLback [Equation 12]
Here, t12back, t24back, tLback are threshold values and may have values obtained via experiments or simulations in advance.
In other words, it may be determined whether the index stateTTSS indicating whether the frequency domain excitation encoding (e.g., GSC) is more appropriate than the spectrum domain encoding is 1 by determining whether the front condition shown in Equation 11 is satisfied or the back condition shown in Equation 12 is not satisfied. Here, the determination of the back condition shown in Equation 12 may be optional.
In an operation 702, if the index stateTTSS is 1, the frequency domain excitation encoding mode may be determined as the final encoding mode. In this case, the spectrum domain encoding mode, which is the initial encoding mode, is modified to the frequency domain excitation encoding mode, which is the final encoding mode.
In an operation 705, if it is determined in the operation 701 that the index stateTTSS is 0, an index stateSS for determining whether an audio signal includes a strong speech characteristic may be checked. If there is an error in the determination of the spectrum domain encoding mode, the frequency domain excitation encoding mode may be more efficient than the spectrum domain encoding mode. The index stateSS for determining whether an audio signal includes a strong speech characteristic may be obtained by using a difference vc between a voicing parameter and a correlation parameter.
A front condition condfront for switching to a strong speech mode by using the difference vc between a voicing parameter and a correlation parameter may be expressed as shown in Equation 13 below.
condfront=vc>vcfront [Equation 13]
Here, vcfront is a threshold value and may have a value obtained via experiments or simulations in advance.
Meanwhile, a back condition condback for finishing the strong speech mode by using the difference vc between a voicing parameter and a correlation parameter may be expressed as shown in Equation 14 below.
condback=vc<vcback [Equation 14]
Here, vcback is a threshold value and may have a value obtained via experiments or simulations in advance.
In other words, in an operation 705, it may be determined whether the index stateSS indicating whether the frequency domain excitation encoding (e.g. GSC) is more appropriate than the spectrum domain encoding is 1 by determining whether the front condition shown in Equation 13 is satisfied or the back condition shown in Equation 14 is not satisfied. Here, the determination of the back condition shown in Equation 14 may be optional.
In an operation 706, if it is determined in the operation 705 that the index stateSS is 0, i.e. the audio signal does not include a strong speech characteristic, the spectrum domain encoding mode may be determined as the final encoding mode. In this case, the spectrum domain encoding mode, which is the initial encoding mode, is maintained as the final encoding mode.
In an operation 707, if it is determined in the operation 705 that the index stateSS is 1, i.e. the audio signal includes a strong speech characteristic, the frequency domain excitation encoding mode may be determined as the final encoding mode. In this case, the spectrum domain encoding mode, which is the initial encoding mode, is modified to the frequency domain excitation encoding mode, which is the final encoding mode.
By performing the operations 700, 701, and 705, an error in the determination of the spectrum domain encoding mode as the initial encoding mode may be corrected. In detail, the spectrum domain encoding mode, which is the initial encoding mode, may be maintained or switched to the frequency domain excitation encoding mode as the final encoding mode.
Meanwhile, if it is determined in the operation 700 that the initial encoding mode is the linear prediction domain encoding mode (stateTS==0), an index stateSM for determining whether an audio signal includes a strong music characteristic may be checked. If there is an error in the determination of the linear prediction domain encoding mode, that is, the time domain excitation encoding mode, the frequency domain excitation encoding mode may be more efficient than the time domain excitation encoding mode. The stateSM for determining whether an audio signal includes a strong music characteristic may be obtained by using a value 1−vc obtained by subtracting the difference vc between a voicing parameter and a correlation parameter from 1.
A front condition condfront for switching to a strong music mode by using the value 1−vc obtained by subtracting the difference vc between a voicing parameter and a correlation parameter from 1 may be expressed as shown in Equation 15 below.
condfront=1−vc>vcmfront [Equation 15]
Here, vcmfront is a threshold value and may have a value obtained via experiments or simulations in advance.
Meanwhile, a back condition condback for finishing the strong music mode by using the value 1−vc obtained by subtracting the difference vc between a voicing parameter and a correlation parameter from 1 may be expressed as shown in Equation 16 below.
condback=1−vc<vcmback [Equation 16]
Here, vcmback is a threshold value and may have a value obtained via experiments or simulations in advance.
In other words, in an operation 709, it may be determined whether the index stateSM indicating whether the frequency domain excitation encoding (e.g. GSC) is more appropriate than the time domain excitation encoding is 1 by determining whether the front condition shown in Equation 15 is satisfied or the back condition shown in Equation 16 is not satisfied. Here, the determination of the back condition shown in Equation 16 may be optional.
In an operation 710, if it is determined in the operation 709 that the index stateSM is 0 i.e. the audio signal does not include a strong music characteristic, the time domain excitation encoding mode may be determined as the final encoding mode. In this case, the linear prediction domain encoding mode, which is the initial encoding mode, is switched to the time domain excitation encoding mode as the final encoding mode. According to an exemplary embodiment, it may be considered that the initial encoding mode is maintained without modification, if the linear prediction domain encoding mode corresponds to the time domain excitation encoding mode.
In an operation 707, if it is determined in the operation 709 that the index stateSM is 1 i.e. the audio signal includes a strong music characteristic, the frequency domain excitation encoding mode may be determined as the final encoding mode. In this case, the linear prediction domain encoding mode, which is the initial encoding mode, is modified to the frequency domain excitation encoding mode, which is the final encoding mode.
By performing the operations 700 and 709, an error in the determination of the initial encoding mode may be corrected. In detail, the linear prediction domain encoding mode (e.g., the time domain excitation encoding mode), which is the initial encoding mode, may be maintained or switched to the frequency domain excitation encoding mode as the final encoding mode.
According to an exemplary embodiment, the operation 709 for determining whether the audio signal includes a strong music characteristic for correcting an error in the determination of the linear prediction domain encoding mode may be optional.
According to another exemplary embodiment, a sequence of performing the operation 705 for determining whether the audio signal includes a strong speech characteristic and the operation 701 for determining whether the frequency domain excitation encoding mode is appropriate may be reversed. In other words, after the operation 700, the operation 705 may be performed first, and then the operation 701 may be performed. In this case, parameters used for the determinations may be changed as occasions demand.
The audio decoding apparatus 800 shown in
Referring to
The spectrum domain decoding unit 820 may decode data encoded in the spectrum domain from the separated encoded data.
The linear prediction domain decoding unit 830 may decode data encoded in the linear prediction domain from the separated encoded data. If the linear prediction domain decoding unit 830 includes the time domain excitation decoding unit 831 and the frequency domain excitation decoding unit 833, the linear prediction domain decoding unit 830 may perform time domain excitation decoding or frequency domain exciding decoding with respect to the separated encoded data.
The switching unit 840 may switch either a signal reconstructed by the spectrum domain decoding unit 820 or a signal reconstructed by the linear prediction domain decoding unit 830 and may provide the switched signal as a final reconstructed signal.
The audio decoding apparatus 900 may include a bitstream parsing unit 910, a spectrum domain decoding unit 920, a linear prediction domain decoding unit 930, a switching unit 940, and a common post-processing module 950. The linear prediction domain decoding unit 930 may include a time domain excitation decoding unit 931 and a frequency domain excitation decoding unit 933, where the linear prediction domain decoding unit 930 may be embodied as at least one of time domain excitation decoding unit 931 and the frequency domain excitation decoding unit 933. Unless it is necessary to be embodied as a separate hardware, the above-stated components may be integrated into at least one module and may be implemented as at least one processor (not shown). Compared to the audio decoding apparatus 800 shown in
Referring to
The methods according to the exemplary embodiments can be written as computer-executable programs and can be implemented in general-use digital computers that execute the programs by using a non-transitory computer-readable recording medium. In addition, data structures, program instructions, or data files, which can be used in the embodiments, can be recorded on a non-transitory computer-readable recording medium in various ways. The non-transitory computer-readable recording medium is any data storage device that can store data which can be thereafter read by a computer system. Examples of the non-transitory computer-readable recording medium include magnetic storage media, such as hard disks, floppy disks, and magnetic tapes, optical recording media, such as CD-ROMs and DVDs, magneto-optical media, such as optical disks, and hardware devices, such as ROM, RAM, and flash memory, specially configured to store and execute program instructions. In addition, the non-transitory computer-readable recording medium may be a transmission medium for transmitting signal designating program instructions, data structures, or the like. Examples of the program instructions may include not only mechanical language codes created by a compiler but also high-level language codes executable by a computer using an interpreter or the like.
While exemplary embodiments have been particularly shown and described above, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the inventive concept as defined by the appended claims. The exemplary embodiments should be considered in descriptive sense only and not for purposes of limitation. Therefore, the scope of the inventive concept is defined not by the detailed description of the exemplary embodiments but by the appended claims, and all differences within the scope will be construed as being included in the present inventive concept.
This application claims the benefit of U.S. Provisional Application No. 61/725,694, filed on Nov. 13, 2012, in the United States Patent and Trademark Office, the disclosure of which is incorporated herein by reference in its entireties.
Number | Date | Country | |
---|---|---|---|
61725694 | Nov 2012 | US |