One or more exemplary embodiments relate to audio encoding, and more particularly, to a signal classification method and apparatus capable of improving the quality of a restored sound and reducing a delay due to encoding mode switching and an audio encoding method and apparatus employing the same.
It is well known that a music signal is efficiently encoded in a frequency domain and a speech signal is efficiently encoded in a time domain. Therefore, various techniques of classifying whether an audio signal in which a music signal and a speech signal are mixed corresponds to the music signal or the speech signal and determining a coding mode in response to a result of the classification have been proposed.
However, frequent switching of coding modes induces the occurrence of a delay and deterioration of the quality of a restored sound, and a technique of correcting an initial classification result has not been proposed, and thus when there is an error in an initial signal classification, the deterioration of restored sound quality occurs.
One or more exemplary embodiments include a signal classification method and apparatus capable of improving restored sound quality by determining a coding mode so as to be suitable for characteristics of an audio signal and an audio encoding method and apparatus employing the same.
One or more exemplary embodiments include a signal classification method and apparatus capable of reducing a delay due to coding mode switching while determining a coding mode so as to be suitable for characteristics of an audio signal and an audio encoding method and apparatus employing the same.
According to one or more exemplary embodiments, a signal classification method includes: classifying a current frame as one of a speech signal and a music signal; determining whether there is an error in a classification result of the current frame, based on feature parameters obtained from a plurality of frames; and correcting the classification result of the current frame in response to a result of the determination.
According to one or more exemplary embodiments, a signal classification apparatus includes at least one processor configured to classify a current frame as one of a speech signal and a music signal, determine whether there is an error in a classification result of the current frame, based on feature parameters obtained from a plurality of frames, and correct the classification result of the current frame in response to a result of the determination.
According to one or more exemplary embodiments, an audio encoding method includes: classifying a current frame as one of a speech signal and a music signal; determining whether there is an error in a classification result of the current frame, based on feature parameters obtained from a plurality of frames; correcting the classification result of the current frame in response to a result of the determination; and encoding the current frame based on the classification result of the current frame or the corrected classification result.
According to one or more exemplary embodiments, an audio encoding apparatus includes at least one processor configured to classify a current frame as one of a speech signal and a music signal, determine whether there is an error in a classification result of the current frame, based on feature parameters obtained from a plurality of frames, correct the classification result of the current frame in response to a result of the determination, and encode the current frame based on the classification result of the current frame or the corrected classification result.
By correcting an initial classification result of an audio signal based on a correction parameter, frequent switching of coding modes may be prevented while determining a coding mode optimized to characteristics of the audio signal.
Hereinafter, an aspect of the present invention is described in detail with respect to the drawings. In the following description, when it is determined that a detailed description of relevant well-known functions or functions may obscure the essentials, the detailed description is omitted.
When it is described that a certain element is ‘connected’ or ‘linked’ to another element, it should be understood that the certain element may be connected or linked to another element directly or via another element in the middle.
Although terms, such as ‘first’ and ‘second’, can be used to describe various elements, the elements cannot be limited by the terms. The terms can be used to classify a certain element from another element.
Components appearing in the embodiments are independently shown to represent different characterized functions, and it is not indicated that each component is formed in separated hardware or a single software configuration unit. The components are shown as individual components for convenience of description, and one component may be formed by combining two of the components, or one component may be separated into a plurality of components to perform functions.
An audio signal classification apparatus 100 shown in
Referring to
According to another exemplary embodiment, an audio signal classification process may include a first operation of classifying an audio signal as a speech signal and a generic audio signal, i.e., a music signal, according to whether the audio signal has a speech characteristic and a second operation of determining whether the generic audio signal is suitable for a generic signal audio coder (GSC). Whether the audio signal can be classified as a speech signal or a music signal may be determined by combining a classification result of the first operation and a classification result of the second operation. When the audio signal is classified as a speech signal, the audio signal may be encoded by a CELP-type coder. The CELP-type coder may include a plurality of modes among an unvoiced coding (UC) mode, a voiced coding (VC) mode, a transient coding (TC) mode, and a generic coding (GC) mode according to a bit rate or a signal characteristic. A generic signal audio coding (GSC) mode may be implemented by a separate coder or included as one mode of the CELP-type coder. When the audio signal is classified as a music signal, the audio signal may be encoded using the transform coder or a CELP/transform hybrid coder. In detail, the transform coder may be applied to a music signal, and the CELP/transform hybrid coder may be applied to a non-music signal, which is not a speech signal, or a signal in which music and speech are mixed. According to an embodiment, according to bandwidths, all of the CELP-type coder, the CELP/transform hybrid coder, and the transform coder may be used, or the CELP-type coder and the transform coder may be used. For example, the CELP-type coder and the transform coder may be used for a narrowband (NB), and the CELP-type coder, the CELP/transform hybrid coder, and the transform coder may be used for a wideband (WB), a super-wideband (SWB), and a full band (FB). The CELP/transform hybrid coder is obtained by combining an LP-based coder which operates in a time domain and a transform domain coder, and may be also referred to as a generic signal audio coder (GSC).
The signal classification of the first operation may be based on a Gaussian mixture model (GMM). Various signal characteristics may be used for the GMM. Examples of the signal characteristics may include open-loop pitch, normalized correlation, spectral envelope, tonal stability, signal's non-stationarity, LP residual error, spectral difference value, and spectral stationarity but are not limited thereto. Examples of signal characteristics used for the signal classification of the second operation may include spectral energy variation characteristic, tilt characteristic of LP analysis residual energy, high-band spectral peakiness characteristic, correlation characteristic, voicing characteristic, and tonal characteristic but are not limited thereto. The characteristics used for the first operation may be used to determine whether the audio signal has a speech characteristic or a non-speech characteristic in order to determine whether the CELP-type coder is suitable for encoding, and the characteristics used for the second operation may be used to determine whether the audio signal has a music characteristic or a non-music characteristic in order to determine whether the GSC is suitable for encoding. For example, one set of frames classified as a music signal in the first operation may be changed to a speech signal in the second operation and then encoded by one of the CELP modes. That is, when the audio signal is a signal of large correlation or an attack signal while having a large pitch period and high stability, the audio signal may be changed from a music signal to a speech signal in the second operation. A coding mode may be changed according to a result of the signal classification described above.
The corrector 130 may correct or maintain the classification result of the signal classifier 110 based on at least one correction parameter. The corrector 130 may correct or maintain the classification result of the signal classifier 110 based on context. For example, when a current frame is classified as a speech signal, the current frame may be corrected to a music signal or maintained as the speech signal, and when the current frame is classified as a music signal, the current frame may be corrected to a speech signal or maintained as the music signal. To determine whether there is an error in a classification result of the current frame, characteristics of a plurality of frames including the current frame may be used. For example, eight frames may be used, but the embodiment is not limited thereto.
The correction parameter may include a combination of at least one of characteristics such as tonality, linear prediction error, voicing, and correlation. Herein, the tonality may include tonality ton2 of a range of 1-2 KHz and tonality ton3 of a range of 2-4 KHz, which may be defined by Equations 1 and 2, respectively.
where a superscript [−j] denotes a previous frame. For example, tonality2[−1] denotes tonality of a range of 1-2 KHz of a one-frame previous frame.
Low-band long-term tonality tonLT may be defined as tonLT=0.2* log10[It_tonality]. Herein, It_tonality may denote full-band long-term tonality.
A difference dft between tonality ton2 of a range of 1-2 KHz and tonality ton3 of a range of 2-4 KHz in an nth frame may be defined as dft=0.2* {log10(tonality2(n))−log10(tonality3(n))).
Next, a linear prediction error LPerr may be defined by Equation 3.
where FVs(9) is defined as FVs(i)=sfaiFVi+sfbi (i=0, . . . , 11) and corresponds to a value obtained by scaling an LP residual log-energy ratio feature parameter defined by Equation 4 among feature parameters used for the signal classifier 110 or 210. In addition, sfai and sfbi may vary according to types of feature parameters and bandwidths and are used to approximate each feature parameter to a range of [0;1].
where E(1) denotes energy of a first LP coefficient, and E(13) denotes energy of a 13th LP coefficient.
Next, a difference dvcor between a value FVs(1) obtained by scaling a normalized correlation feature or a voicing feature FV1, which is defined by Equation 5 among the feature parameters used for the signal classifier 110 or 210, based on FVs(i)=sfaiFVi+sfbi (i=0, . . . , 11) and a value FVs(7) obtained by scaling a correlation map feature FV(7), which is defined by Equation 6, based on FVs(i)=sfaiFVi+sfbi (i=0, . . . , 11) may be defined as dvcor=max(FVs(1)-FVs(7),0).
FV1=Cnorm[.] (5)
where Cnorm[.] denotes a normalized correlation in a first or second half frame.
where Mcor denotes a correlation map of a frame.
A correction parameter including at least one of conditions 1 through 4 may be generated using the plurality of feature parameters, taken alone or in combination. Herein, the conditions 1 and 2 may indicate conditions by which a speech state SPEECH_STATE can be changed, and the conditions 3 and 4 may indicate conditions by which a music state MUSIC_STATE can be changed. In detail, the condition 1 enables the speech state SPEECH_STATE to be changed from 0 to 1, and the condition 2 enables the speech state SPEECH_STATE to be changed from 1 to 0. In addition, the condition 3 enables the music state MUSIC_STATE to be changed from 0 to 1, and the condition 4 enables the music state MUSIC_STATE to be changed from 1 to 0. The speech state SPEECH_STATE of 1 may indicate that a speech probability is high, that is, CELP-type coding is suitable, and the speech state SPEECH_STATE of 0 may indicate that non-speech probability is high. The music state MUSIC_STATE of 1 may indicate that transform coding is suitable, and the music state MUSIC_STATE of 0 may indicate that CELP/transform hybrid coding, i.e., GSC, is suitable. As another example, the music state MUSIC_STATE of 1 may indicate that transform coding is suitable, and the music state MUSIC_STATE of 0 may indicate that CELP-type coding is suitable.
The condition 1 (fA) may be defined, for example, as follows. That is, when dvcor>0.4 AND dft<0.1 AND FVs(1)>(2*FVs(7)+0.12) AND ton2<dvcor AND ton3<dvcor AND tonLT<dvcor AND FVs(7)<dvcor AND FVs(1)>dvcor AND FVs(1)>0.76, fA may be set to 1.
The condition 2 (fB) may be defined, for example, as follows. That is, when dvcor<0.4, fB may be set to 1.
The condition 3 (fc) may be defined, for example, as follows. That is, when 0.26<ton2<0.54 AND ton3>0.22 AND 0.26<tonLT<0.54 AND LPerr>0.5, fC may be set to 1.
The condition 4 (fD) may be defined, for example, as follows. That is, when ton2<0.34 AND ton3<0.26 AND 0.26<tonLT<0.45, fD may be set to 1.
A feature or a set of features used to generate each condition is not limited thereto. In addition, each constant value is only illustrative and may be set to an optimal value according to an implementation method.
In detail, the corrector 130 may correct errors in the initial classification result by using two independent state machines, for example, a speech state machine and a music state machine. Each state machine has two states, and hangover may be used in each state to prevent frequent transitions. The hangover may include, for example, six frames. When a hangover variable in the speech state machine is indicated by hangsp, and a hangover variable in the music state machine is indicated by hangmus, if a classification result is changed in a given state, each variable is initialized to 6, and thereafter, hangover decreases by 1 for each subsequent frame. A state change may occur only when hangover decreases to zero. In each state machine, a correction parameter generated by combining at least one feature extracted from the audio signal may be used.
An audio signal classification apparatus 200 shown in
Referring to
An audio encoding apparatus 300 shown in
Referring to
In the encoding module 330, the first coder 331 may operate when the classification result of the corrector 130 or 230 corresponds to a speech signal. The second coder 333 may operate when the classification result of the corrector 130 corresponds to a music signal, or when the classification result of the fine classifier 350 corresponds to a speech signal. The third coder 335 may operate when the classification result of the corrector 130 corresponds to a music signal, or when the classification result of the fine classifier 350 corresponds to a music signal.
Referring to
In operation 420, it may be determined whether the initial classification result, i.e., the speech state, is 0, the condition 1 (fA) is 1, and the hangover hangsp of the speech state machine is 0. If it is determined in operation 420 that the initial classification result, i.e., the speech state, is 0, the condition 1 is 1, and the hangover hangsp of the speech state machine is 0, in operation 430, the speech state may be changed to 1, and the hangover may be initialized to 6. The initialized hangover value may be provided to operation 460. Otherwise, if the speech state is not 0, the condition 1 is not 1, or the hangover hangsp of the speech state machine is not 0 in operation 420, the method may proceed to operation 440.
In operation 440, it may be determined whether the initial classification result, i.e., the speech state, is 1, the condition 2 (fB) is 1, and the hangover hangsp of the speech state machine is 0. If it is determined in operation 440 that the speech state is 1, the condition 2 is 1, and the hangover hangsp of the speech state machine is 0, in operation 450, the speech state may be changed to 0, and the hangoversp may be initialized to 6. The initialized hangover value may be provided to operation 460. Otherwise, if the speech state is not 1, the condition 2 is not 1, or the hangover hangsp of the speech state machine is not 0 in operation 440, the method may proceed to operation 460 to perform a hangover update for decreasing the hangover by 1.
Referring to
In operation 520, it may be determined whether the initial classification result, i.e., the music state, is 1, the condition 3(fc) is 1, and the hangover hangmus of the music state machine is 0. If it is determined in operation 520 that the initial classification result, i.e., the music state, is 1, the condition 3 is 1, and the hangover hangmus of the music state machine is 0, in operation 530, the music state may be changed to 0, and the hangover may be initialized to 6. The initialized hangover value may be provided to operation 560. Otherwise, if the music state is not 1, the condition 3 is not 1, or the hangover hangmus of the music state machine is not 0 in operation 520, the method may proceed to operation 540.
In operation 540, it may be determined whether the initial classification result, i.e., the music state, is 0, the condition 4 (fD) is 1, and the hangover hangmus of the music state machine is 0. If it is determined in operation 540 that the music state is 0, the condition 4 is 1, and the hangover hangmus of the music state machine is 0, in operation 550, the music state may be changed to 1, and the hangover hangmus may be initialized to 6. The initialized hangover value may be provided to operation 560. Otherwise, if the music state is not 0, the condition 4 is not 1, or the hangover hangmus of the music state machine is not 0 in operation 540, the method may proceed to operation 560 to perform a hangover update for decreasing the hangover by 1.
Referring to
Referring to
The coding mode determination apparatus shown in
Referring to
When the initial coding mode is determined as the first coding mode, the corrector 830 may correct the initial coding mode to the second coding mode based on correction parameters. For example, when an initial classification result indicates a speech signal but has a music characteristic, the initial classification result may be corrected to a music signal. When the initial coding mode is determined as the second coding mode, the corrector 830 may correct the initial coding mode to the first coding mode or the third coding mode based on correction parameters. For example, when an initial classification result indicates a music signal but has a speech characteristic, the initial classification result may be corrected to a speech signal.
Referring to
In operation 930, it may be determined based on correction parameters whether there is an error in the classification result of operation 910. If it is determined in operation 930 that there is an error in the classification result, the classification result may be corrected in operation 950. If it is determined in operation 930 that there is no error in the classification result, the classification result may be maintained as it is in operation 970. Operations 930 through 970 may be performed by the corrector 130 or 230 of
A multimedia device 1000 shown in
Referring to
The communication unit 1010 is configured to enable transmission and reception of data to and from an external multimedia device or server through a wireless network such as wireless Internet, a wireless intranet, a wireless telephone network, a wireless local area network (LAN), a Wi-Fi network, a Wi-Fi Direct (WFD) network, a third generation (3G) network, a 4G network, a Bluetooth network, an infrared data association (IrDA) network, a radio frequency identification (RFID) network, an ultra wideband (UWB) network, a ZigBee network, and a near field communication (NFC) network or a wired network such as a wired telephone network or wired Internet.
The encoding module 1030 may encode an audio signal of the time domain, which is provided through the communication unit 1010 or the microphone 1070, according to an embodiment. The encoding process may be implemented using the apparatus or method shown in
The storage unit 1050 may store various programs required to operate the multimedia device 1000.
The microphone 1070 may provide an audio signal of a user or the outside to the encoding module 1030.
A multimedia device 1100 shown in
A detailed description of the same components as those in the multimedia device 1000 shown in
The decoding module 1130 may receive a bitstream provided through the communication unit 1110 and decode an audio spectrum included in the bitstream. The decoding module 1130 may be implemented in correspondence to the encoding module 330 of
The speaker 1170 may output a reconstructed audio signal generated by the decoding module 1130 to the outside.
The multimedia devices 1000 and 1100 shown in
When the multimedia device 1000 or 1100 is, for example, a mobile phone, although not shown, a user input unit such as a keypad, a display unit for displaying a user interface or information processed by the mobile phone, and a processor for controlling a general function of the mobile phone may be further included. In addition, the mobile phone may further include a camera unit having an image pickup function and at least one component for performing functions required by the mobile phone.
When the multimedia device 1000 or 1100 is, for example, a TV, although not shown, a user input unit such as a keypad, a display unit for displaying received broadcast information, and a processor for controlling a general function of the TV may be further included. In addition, the TV may further include at least one component for performing functions required by the TV.
The methods according to the embodiments may be edited by computer-executable programs and implemented in a general-use digital computer for executing the programs by using a computer-readable recording medium. In addition, data structures, program commands, or data files usable in the embodiments of the present invention may be recorded in the computer-readable recording medium through various means. The computer-readable recording medium may include all types of storage devices for storing data readable by a computer system. Examples of the computer-readable recording medium include magnetic media such as hard discs, floppy discs, or magnetic tapes, optical media such as compact disc-read only memories (CD-ROMs), or digital versatile discs (DVDs), magneto-optical media such as floptical discs, and hardware devices that are specially configured to store and carry out program commands, such as ROMs, RAMs, or flash memories. In addition, the computer-readable recording medium may be a transmission medium for transmitting a signal for designating program commands, data structures, or the like. Examples of the program commands include a high-level language code that may be executed by a computer using an interpreter as well as a machine language code made by a compiler.
Although the embodiments of the present invention have been described with reference to the limited embodiments and drawings, the embodiments of the present invention are not limited to the embodiments described above, and their updates and modifications could be variously carried out by those of ordinary skill in the art from the disclosure. Therefore, the scope of the present invention is defined not by the above description but by the claims, and all their uniform or equivalent modifications would belong to the scope of the technical idea of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2015/001783 | 2/24/2015 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62029672 | Jul 2014 | US | |
61943638 | Feb 2014 | US |