The present disclosure relates to a method and apparatus for enhancing speech quality based on bandwidth extension, a speech decoding method and apparatus, and a multimedia device employing the same.
Various techniques for increasing speech call quality in a terminal such as a mobile phone or a tablet PC have been developed. For example, quality of a speech signal to be provided from a transmission end may be enhanced through pre-processing. Specifically, speech quality may be enhanced by detecting the characteristics of ambient noise to remove noise from the speech signal to be provided from the transmission end. As another example, speech quality may be enhanced by equalizing, in consideration of the characteristics of the ears of a terminal user, a speech signal restored by a reception end. As another example, enhanced speech quality of the restored speech signal may be provided by preparing a plurality of pre-sets in consideration of the general characteristics of the ears to the reception end and allowing the terminal user to select and use one thereof.
In addition, speech quality may be enhanced by extending a frequency bandwidth of a codec used for a call in the terminal, and particularly, a technique of extending a bandwidth without changing a configuration of a standardized codec has been required.
Provided are a method and apparatus for enhancing speech quality based on bandwidth extension.
Provided are a speech decoding method and apparatus for enhancing speech quality based on bandwidth extension.
Provided is a multimedia device employing a function of enhancing speech quality based on bandwidth extension.
According to a first aspect of the present disclosure, a method of enhancing speech quality includes: generating a high-frequency signal by using a low-frequency signal in a time domain; combining the low-frequency signal with the high-frequency signal; transforming the combined signal into a spectrum in a frequency domain; determining a class of a decoded speech signal; predicting an envelope from a low-frequency spectrum obtained in the transforming; and generating a final high-frequency spectrum by applying the predicted envelope to a high-frequency spectrum obtained in the transforming.
The predicting of the envelope may include: predicting energy from the low-frequency spectrum of the speech signal; predicting a shape from the low-frequency spectrum of the speech signal; and calculating the envelope by using the predicted energy and the predicted shape.
The predicting of the energy may include applying a limiter to the predicted energy.
The predicting of the shape may include predicting each of a voiced shape and a unvoiced shape and predicting the shape from the voiced shape and the unvoiced shape based on the class and a voicing level.
The predicting of the shape may include: configuring an initial shape for the high-frequency spectrum from the low-frequency spectrum of the speech signal; and shape-rotating the initial shape.
The predicting of the shape may further include adjusting dynamics of the rotated initial shape.
The method may further include equalizing at least one of the low-frequency spectrum and the high-frequency spectrum.
The method may further include: equalizing at least one of the low-frequency spectrum and the high-frequency spectrum; inverse-transforming the equalized spectrum into a signal in the time domain; and post-processing the signal transformed into the time domain.
The equalizing and the inverse-transforming into the time domain may be performed on a sub-frame basis, and the post-processing may be performed on a sub-sub-frame basis.
The post-processing may include: calculating low-frequency energy and high-frequency energy; estimating a gain for matching the low-frequency energy and the high-frequency energy; and applying the estimated gain to a high-frequency time-domain signal.
The estimating of the gain may include limiting the estimated gain to a predetermined threshold if the estimated gain is greater than the threshold.
According to a second aspect of the present disclosure, a method of enhancing speech quality includes: determining a class of a decoded speech signal from a feature of the speech signal; generating a modified low-frequency spectrum by mixing a low-frequency spectrum and random noise based on the class; predicting an envelope of a high-frequency band from the low-frequency spectrum based on the class; applying the predicted envelope to a high-frequency spectrum generated from the modified low-frequency spectrum; and generating a bandwidth-extended speech signal by using the decoded speech signal and the envelope-applied high-frequency spectrum.
The generating of the modified low-frequency spectrum may include: determining a first weighting based on a prediction error; predicting a second weighting based on the first weighting and the class; whitening the low-frequency spectrum based on the second weighting; and generating the modified low-frequency spectrum by mixing the whitened low-frequency spectrum and random noise based on the second weight.
Each operation may be performed on a sub-frame basis.
The class may include a plurality of candidate classes based on low-frequency energy.
According to a third aspect of the present disclosure, an apparatus for enhancing speech quality includes a processor, wherein the processor determines a class of a decoded speech signal from a feature of the speech signal, generates a modified low-frequency spectrum by mixing a low-frequency spectrum and random noise based on the class, predicts an envelope of a high-frequency band from the low-frequency spectrum based on the class, applies the predicted envelope to a high-frequency spectrum generated from the modified low-frequency spectrum, and generates a bandwidth-extended speech signal by using the decoded speech signal and the envelope-applied high-frequency spectrum.
According to a fourth aspect of the present disclosure, a speech decoding apparatus includes: a speech decoder configured to decode an encoded bitstream; and a post-processor configured to generate bandwidth-extended wideband speech data from decoded speech data, wherein the post-processor determines a class of a decoded speech signal from a feature of the speech signal, generates a modified low-frequency spectrum by mixing a low-frequency spectrum and random noise based on the class, predicts an envelope of a high-frequency band from the low-frequency spectrum based on the class, applies the predicted envelope to a high-frequency spectrum generated from the modified low-frequency spectrum, and generates a bandwidth-extended speech signal by using the decoded speech signal and the envelope-applied high-frequency spectrum.
According to a fourth aspect of the present disclosure, a multimedia device includes: a communication unit configured to receive an encoded speech packet; a speech decoder configured to decode the received speech packet; and a post-processor configured to generate bandwidth-extended wideband speech data from the decoded speech data, wherein the post-processor determines a class of a decoded speech signal from a feature of the speech signal, generates a modified low-frequency spectrum by mixing a low-frequency spectrum and random noise based on the class, predicts an envelope of a high-frequency band from the low-frequency spectrum based on the class, applies the predicted envelope to a high-frequency spectrum generated from the modified low-frequency spectrum, and generates a bandwidth-extended speech signal by using the decoded speech signal and the envelope-applied high-frequency spectrum.
A decoding end may obtain a bandwidth-extended wideband signal from a narrow-band speech signal without changing a configuration of a standardized codec, and thus a restored signal of which speech quality has been enhanced may be generated.
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure belongs may easily realize the embodiments. However, the embodiments may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. In addition, parts irrelevant to the description are omitted to clearly describe the embodiments, and like reference numerals denote like elements throughout the present disclosure.
Throughout the present disclosure, when it is described that a certain part is “connected” to another part, it should be understood that the certain part may be connected to another part “electrically or physically” via another part in the middle. In addition, when a certain part “includes” a certain component, this indicates that the part may further include another component instead of excluding another component unless there is different disclosure.
Hereinafter, the embodiments are described in detail with reference to the accompanying drawings.
The apparatus 100 shown in
Referring to
The post-processor 130 may perform post-processing for speech quality enhancement with respect to the decoded speech data provided from the decoding unit 110. According to an embodiment, the post-processor 130 may include a wideband bandwidth extension module. The post-processor 130 may increase a natural property and a sense of realism of speech by extending a bandwidth of the speech data, which has been decoded by the decoding unit 110 by using the narrowband codec, into a wideband. The bandwidth extension processing applied to the post-processor 130 may be largely divided into a guided scheme of providing additional information for the bandwidth extension processing from a transmission end and a non-guided scheme, i.e., a blind scheme, of not providing the additional information for the bandwidth extension processing from the transmission end. The guided scheme may require a change in a configuration of a codec for a call in the transmission end. However, the blind scheme may enhance speech quality by changing a post-processing portion at a reception end without the configuration change of the codec for a call in the transmission end.
The device 200 shown in
Referring to
The decoding unit 250 may decode the received speech communication call packet or the encoded bitstream. The decoding unit 250 may provide decoded speech data to the post-processor 270. The decoding unit 250 may use a standardized codec but is not limited thereto. According to an embodiment, the decoding unit 250 may include a narrowband codec, and an example of the narrowband codec is an AMR codec.
The post-processor 270 may perform post-processing for speech quality enhancement with respect to the decoded speech data provided from the decoding unit 250. According to an embodiment, the post-processor 270 may include a wideband bandwidth extension module. The post-processor 270 may increase a natural property and a sense of realism of speech by extending a bandwidth of the speech data, which has been decoded by the decoding unit 250 by using the narrowband codec, into a wideband. The bandwidth extension processing performed by the post-processor 270 may be largely divided into the guided scheme of providing additional information for the bandwidth extension processing from a transmission end and the non-guided scheme, i.e., the blind scheme, of not providing the additional information for the bandwidth extension processing from the transmission end. The guided scheme may require a change in a configuration of a codec for a call in the transmission end. However, the blind scheme may enhance speech quality by changing post-processing at a reception end without the configuration change of the codec for a call in the transmission end. The post-processor 270 may transform the bandwidth-extended speech data into an analog signal.
The output unit 290 may output the analog signal provided from the post-processor 270. The output unit 290 may be replaced with a receiver, a speaker, earphones, or headphones. The output unit 290 may be connected to the post-processor 270 in a wired or wireless manner.
The apparatus 300 shown in
Referring to
The signal classifier 320 may determine a type or class by classifying the speech signal based on a feature of the speech signal. As the feature of the speech signal, any one of or both a time-domain feature and a frequency-domain feature may be used. The time-domain feature and the frequency-domain feature may include a plurality of well-known parameters.
The low-frequency spectrum modifier 330 may modify the frequency-domain signal, i.e., a low-frequency spectrum or a low-frequency excitation spectrum, from the transformer 310 based on the class of the speech signal.
The high-frequency spectrum generator 340 may generate a high-frequency spectrum by obtaining a high-frequency excitation spectrum from the modified low-frequency spectrum or low-frequency excitation spectrum, predicting an envelope from the low-frequency spectrum based on the class of the speech signal, and applying the predicted envelope to the high-frequency excitation spectrum.
The equalizer 350 may equalize the generated high-frequency spectrum.
The time-domain post-processor 360 may transform the equalized high-frequency spectrum into a high-frequency time-domain signal, generate a wideband speech signal, i.e., an enhanced speech signal, by combining the high-frequency time-domain signal and a low-frequency time-domain signal, and perform post-processing such as filtering.
The apparatus 400 shown in
Referring to
The transformer 433 may generate a frequency-domain signal, i.e., a low-frequency spectrum, by transforming the up-sampled signal. The transform may be modified discrete cosine transform (MDCT), fast Fourier transform (FFT), modified discrete cosine transform and modified discrete sine transform (MDCT+MDST), quadrature mirror filter (QMF), or the like but is not limited thereto. Herein, the low-frequency spectrum may indicate a low-band or core spectrum.
The signal classifier 435 may extract a feature of a signal by receiving the up-sampled signal and the frequency-domain signal and determine a class, i.e., a type, of the speech signal based on the extracted feature. Since the up-sampled signal is a time-domain signal, the signal classifier 435 may extract a feature of each of the time-domain signal and the frequency-domain signal. Class information generated by the signal classifier 435 may be provided to the low-frequency spectrum modifier 437 and the envelope predictor 441.
The low-frequency spectrum modifier 437 may receive the frequency-domain signal provided from the transformer 433 and modify the received frequency-domain signal into a low-frequency spectrum, which is a signal suitable for bandwidth extension processing, based on the class information provided from the signal classifier 435. The low-frequency spectrum modifier 437 may provide the modified low-frequency spectrum to the high-frequency excitation generator 439. Herein, a low-frequency excitation spectrum may be used instead of the low-frequency spectrum.
The high-frequency excitation generator 439 may generate a high-frequency excitation spectrum by using the modified low-frequency spectrum. Specifically, the modified low-frequency spectrum may be obtained from an original low-frequency spectrum, and the high-frequency excitation spectrum may be a spectrum simulated based on the modified low-frequency spectrum. Herein, the high-frequency excitation spectrum may indicate a high-band excitation spectrum.
The envelope predictor 441 may receive the frequency-domain signal provided from the transformer 433 and the class information provided from the signal classifier 435 and predict an envelope.
The envelope application unit 443 may generate a high-frequency spectrum by applying the envelope provided from the envelope predictor 441 to the high-frequency excitation spectrum provided from the high-frequency excitation generator 439.
The equalizer 445 may receive the high-frequency spectrum provided from the envelope application unit 443 and equalize a high-frequency band. Alternatively, the low-frequency spectrum from the transformer 433 may also be input to the equalizer 445 through various routes. In this case, the equalizer 445 may selectively equalize a low-frequency band and the high-frequency band or equalize a full band. The equalizing may use various well-known methods. For example, adaptive equalizing for each band may be performed.
The inverse transformer 447 may generate a time-domain signal by inverse-transforming the high-frequency spectrum provided from the equalizer 445. Alternatively, the equalized low-frequency spectrum from the transformer 433 may also be provided to the inverse transformer 447. In this case, the inverse transformer 447 may generate a low-frequency time-domain signal and a high-frequency time-domain signal by individually inverse-transforming the low-frequency spectrum and the high-frequency spectrum. According to an embodiment, as the low-frequency time-domain signal, the signal of the up-sampler 431 may be used as it is, and the inverse transformer 447 may generate only the high-frequency time-domain signal. In this case, since the low-frequency time-domain signal is the same as an original speech signal, the low-frequency time-domain signal may be processed without the occurrence of a delay.
The time-domain post-processor 449 may suppress noises by post-processing the low-frequency time-domain signal and the high-frequency time-domain signal provided from the inverse transformer 447 and generate a wideband time-domain signal by synthesizing the post-processed low-frequency time-domain signal and high-frequency time-domain signal. The signal generated by the time-domain post-processor 449 may be a signal of a 2*N- or M*N-KHz sampling rate (M is 2 or greater). The time-domain post-processor 449 may be optionally included. According to an embodiment, both the low-frequency time-domain signal and the high-frequency time-domain signal may be equalized signals. According to another embodiment, the low-frequency time-domain signal may be an original narrowband signal, and the high-frequency time-domain signal may be an equalized signal.
According to an embodiment, even when no information about a high-frequency band is provided from an AMR bitstream, a high-frequency spectrum may be generated through prediction from a narrowband spectrum.
Referring to
Referring to
The signal classification module 700 shown in
Referring to
The time-domain feature extractor 730 may extract a time-domain feature from the time-domain signal provided from the up-sampler (431 of
The class determiner 750 may generate class information by determining a class of a speech signal, e.g., a class of a current sub-frame, from the frequency-domain feature and the time-domain feature. The class information may include a single class or a plurality of candidate classes. In addition, the class determiner 750 may obtain a voicing level from the class determined with respect to the current sub-frame. The determined class may be a class having the highest probability value. According to an embodiment, a voicing level is mapped for each class, and a voicing level corresponding to the determined class may be obtained. Alternatively, a final voicing level of the current sub-frame may be obtained by using the voicing level of the current sub-frame and a voicing level of at least one previous sub-frame.
An operation of each component is described in more detail as follows.
Examples of the feature extracted from the frequency-domain feature extractor 710 may be centroid C and energy quotient E but are not limited thereto.
The centroid C may be defined by Equation 1.
where x denotes a spectral coefficient.
The energy quotient E may be defined by a ratio of short-term energy EShort to long-term energy ELong by using Equation 2.
Herein, both the short-term energy and the long-term energy may be determined based on a history up to a previous sub-frame. In this case, a short term and a long term are discriminated according to a level of a contributory portion of the current sub-frame with respect to energy, and for example, compared with the short term, the long term may be defined by a method of multiplying an average of energy up to the previous sub-frame by a higher rate. Specifically, the long term is designed such that energy of the current sub-frame is reflected less, and the short term is designed such that the energy of the current sub-frame is reflected more when compared with the long term.
An example of the feature extracted from the time-domain feature extractor 730 may be gradient index G but is not limited thereto.
The gradient index G may be defined by Equation 3
where t denotes a time-domain signal and sign denotes +1 when the signal is 0 or greater and −1 when the signal is less than 0.
The class determiner 750 may determine a class of the speech signal from at least one frequency-domain feature and at least one time-domain feature. According to an embodiment, a Gaussian mixture model (GMM) that is well known based on low-frequency energy may be used to determine the class. The class determiner 750 may decide one class for each sub-frame or derive a plurality of candidate classes based on soft decision. According to an embodiment, when the low-frequency energy is based and is a specific value or less, one class is decided, and when the low-frequency energy is the specific value or more, a plurality of candidate classes may be derived. Herein, the low-frequency energy may indicate narrowband energy or energy of a specific frequency band or less. The plurality of candidate classes may include, for example, a class having the highest probability value and classes adjacent to the class having the highest probability value. When the plurality of candidate classes are selected, each class has a probability value, and thus a prediction value is calculated in consideration of a probability value. A voicing level mapped to the single class or the class having the highest probability value may be used. Energy prediction may be performed based on the candidate classes and probability values of the candidate classes. Prediction may be performed for each candidate class, and a final prediction value may be determined by multiplying a probability value by a prediction value obtained as a result of the prediction.
The envelope prediction module 800 shown in
Referring to
The shape predictor 830 may predict a shape of the high-frequency spectrum from the frequency-domain signal, i.e., the low-frequency spectrum, based on the class information and voicing level information. The shape predictor 830 may predict a shape with respect to each of a voiced speech and a unvoiced speech. An embodiment of the shape predictor 830 will be described in more detail with reference to
An energy predictor 900 shown in
Referring to
{tilde over (E)}=Σ{tilde over (E)}j*probj (4)
Specifically, final predicted energy {tilde over (E)} may be obtained by predicting {tilde over (E)}j for each of a plurality of candidate classes, multiplying {tilde over (E)}j by a determined probability value probj, and then summing the multiplication result for the plurality of candidate classes. To this end, {tilde over (E)}j may be predicted by obtaining a basis including a codebook set for each class, a low-frequency envelope extracted from a current sub-frame, and a standard deviation of the low-frequency envelope and multiplying the obtained basis by a matrix stored for each class.
The low-frequency envelope Env(i) may be defined by Equation 5. That is, energy may be predicted by using log energy for each sub-band of a low frequency and a standard deviation.
{tilde over (E)} may be obtained by Equation 4 using the obtained {tilde over (E)}j.
The limiter application unit 730 may apply a limiter to the predicted energy {tilde over (E)} provided from the first predictor 710 to suppress noises which may occur when a value of {tilde over (E)} is too great. In this case, as energy acting as the limiter, a linear envelope defined by Equation 6 may be used instead of a log-domain envelope.
A basis may be configured by obtaining a plurality of centroids C defined by Equation 7 from the linear envelope obtained from Equation 6.
where CLB denotes a centroid value calculated by the frequency-domain feature extractor 710 of
The energy smoothing unit 950 performs energy-smoothing by reflecting a plurality of energy values predicted in a previous sub-frame to the predicted energy provided from the limiter application unit 930. As an example of the smoothing, a predicted energy difference between the previous sub-frame and the current sub-frame may be restricted within a predetermined range. The energy smoothing unit 950 may be optionally included.
A shape predictor 830 shown in
Referring to
The unvoiced shape predictor 1030 may predict a unvoiced shape of the high-frequency band by using the low-frequency linear envelope, i.e., the low-frequency shape, and adjust the unvoiced shape according to a shape comparison result between a low-frequency part and a high-frequency part in the high-frequency band.
The second predictor 1050 may predict a shape of a high-frequency spectrum by mixing the voiced shape and the unvoiced shape at a ratio based on a voicing level.
Referring back to
The envelope post-processor 870 may post-process the envelope provided from the envelope calculator 850. As an example of the post-processing, an envelope of a start portion of a high frequency may be adjusted by considering an envelope of an end portion of a low frequency at a boundary between the low frequency and the high frequency. The envelope post-processor 870 may be optionally included.
Referring to
In a unvoiced shape generation step 1150, a unvoiced shape is basically generated through transposing, and if a shape of a high-frequency part is greater than a shape of a low-frequency part through comparison therebetween in the high-frequency band, the shape of the high-frequency part may be reduced. As a result, the possibility that noise occurs due to a relative increase in the shape of the high-frequency part in the high-frequency band may be reduced.
In a mixing step 1170, a predicted shape of a high-frequency spectrum may be generated by mixing the generated voiced shape and the generated unvoiced shape based on a voicing level. Herein, a mixing ratio may be determined by using the voicing level. The predicted shape may be provided to the envelope calculator 850 of
The module 1200 shown in
Referring to
The weighting predictor 1230 may predict the second weighting of the high-frequency spectrum based on the first weighting of the low-frequency spectrum, which is provided from the weighting calculator 1210.
Specifically, when the high-frequency excitation generator 439 of
wi=gi,midx*wj (10)
where gi,midx denotes a constant to be multiplied by the band i determined by a class index midx, and wj denotes a calculated first weighting of a source band j.
The whitening unit 1250 may whiten the low-frequency spectrum by defining a whitening envelope in consideration of an ambient spectrum for each frequency bin with respect to a frequency-domain signal, i.e., the low-frequency spectrum, and multiplying the low-frequency spectrum by a reciprocal number of the defined whitening envelope. In this case, a range of the considered ambient spectrum may be determined based on the second weight of the high-frequency spectrum, which is provided from the weight predictor 1230. Specifically, the range of the considered ambient spectrum may be determined based on a window obtained by multiplying a size of a basic window by the second weighting, and the second weighting may be obtained from a corresponding target band based on a mapping relationship between a source band and a target band. A rectangular window may be used as the basic window, but the basic window is not limited thereto. The whitening may be performed by obtaining energy within the determined window and scaling a low-frequency spectrum corresponding to a frequency bin based on a square root of the energy.
The random noise generator 1270 may generate random noise by various well-known methods.
The weighting application unit 1290 may receive the whitened low-frequency spectrum and the random noise and mix the whitened low-frequency spectrum and the random noise by applying the second weighting of the high-frequency spectrum, thereby generating a modified low-frequency spectrum. As a result, the weight application unit 1290 may provide the modified low-frequency spectrum to the envelope application unit 443.
The module 1300 shown in
Referring to
According to an example of transposing and folding, which is shown in
The module 1500 shown in
Referring to
The noise reducer 1530 may reduce noise occurring in the silence period by gradually reducing a size of a high-frequency spectrum of the current sub-frame when the current sub-frame is detected as the silence period. To this end, the noise reducer 1530 may apply a noise reduction gain on a sub-frame basis. When a signal of a full band including a low frequency and a high frequency is gradually reduced, the noise reduction gain may be set to converge to a value close to 0. In addition, when a sub-frame in the silence period is changed to a sub-frame in a non-silence period, a magnitude of a signal is gradually increased, and in this case, the noise reduction gain may be set to converge to 1. The noise reducer 1530 may set a rate of the noise reduction gain for gradual reduction to be less than that of the noise reduction gain for gradual increase, such that reduction is slowly achieved, whereas the increase is quickly achieved. Herein, the rate may indicate a magnitude of an increase portion or a reduction portion for each sub-frame when a gain is gradually increased or reduced for each sub-frame. The silence detector 1510 and the noise reducer 1530 may be selectively applied.
The spectrum equalizer 1550 may change a noise-reduced signal provided from the noise reducer 1530 to a speech relatively preferred by a user by applying a different equalizer gain for each frequency band or sub-band to the noise-reduced signal provided from the noise reducer 1530. Alternatively, the same equalizer gain may be applied to specific frequency bands or sub-bands. The spectrum equalizer 1550 may apply the same equalizer gain to all signals, i.e., a full frequency band. Alternatively, an equalizer gain for a voiced speech and an equalizer gain for a unvoiced speech may be differently set, and the two equalizer gains may be mixed by a weighted sum based on a voicing level of a current sub-frame and applied. As a result, the spectrum equalizer 1550 may provide a spectrum of which speech quality has been enhanced and from which noise has been cancelled to the inverse transformer (447 of
The module 1600 shown in
Referring to
The second energy calculator 1630 may calculate high-frequency energy from a high-frequency time-domain signal on a sub-sub-frame basis.
The gain estimator 1650 may estimate a gain to be applied to a current sub-sub-frame to match a ratio between the current sub-sub-frame and a previous sub-sub-frame in the high-frequency energy with a ratio between the current sub-sub-frame and the previous sub-sub-frame in the low-frequency energy. The estimated gain g(i) may be defined by Equation 11.
where EH(i) and EL(i) denote high-frequency energy and low-frequency energy of an ith sub-sub-frame.
To prevent the gain g(i) from having a too large value, a predetermined threshold gth may be used. That is, as in Equation 12 below, when the gain g(i) is greater than the predetermined threshold gth, the predetermined threshold gth may be estimated as the gain g(i).
The gain application unit 1670 may apply the gain estimated by the gain estimator 1650 to the high-frequency time-domain signal.
The combining unit 1690 may generate a bandwidth-extended time-domain signal, i.e., a wideband time-domain signal, by combining the low-frequency time-domain signal and the gain-applied high-frequency time-domain signal.
The apparatus 1700 shown in
Referring to
The combining unit 1735 may combine a shifted time-domain signal, i.e., the high-frequency excitation signal, provided from the high-frequency excitation generator 1733 and the up-sampled signal, i.e., the low-frequency signal and provide the combined signal to the transformer 1737.
The transformer 1737 may generate a frequency-domain signal by transforming the signal in which a low frequency and a high frequency are combined, which is provided from the combiner 1735. The transform may be MDCT, FFT, MDCT+MDST, QMF, or the like but is not limited thereto.
The signal classifier 1739 may use the low-frequency signal provided from the up-sampler 1731 or the signal in which the low frequency and the high frequency are combined, which is provided from the combiner 1735, to extract a feature of the time domain. The signal classifier 1739 may use a full-band spectrum provided from the transformer 1737 to extract a feature of the frequency domain. In this case, a low-frequency spectrum may be selectively used from the full-band spectrum. The other operation of the signal classifier 1739 may be the same as an operation of the signal classifier 435 of
The envelope predictor 1741 may predict an envelope of the high frequency by using the low-frequency spectrum as in
According to the embodiment of
A shape predictor 1800 shown in
Referring to
The shape rotation processor 1830 may shape-rotate the initial shape. For the shape rotation, a slope may be defined by Equation 13.
where Env denotes an envelope value for each band, NI denotes a plurality of initial start bands, and NB denotes a full band.
The shape rotation processor 1830 may extract an envelope value from the initial shape and calculate a slope by using the envelope value, to perform the shape rotation.
The shape rotation may be performed by Equation 14, wherein the rotation may be performed by a rotation factor ρ=1−slpIf.
The shape dynamics adjuster 1850 may adjust dynamics of the rotated shape. The dynamics adjustment may be performed by using Equation 15.
Herein, a dynamics adjustment factor d may be defined as d=0.5 slp.
As described above, since the rotation is performed while maintaining a shape of the low frequency, a natural tone may be generated. Particularly, with respect to a unvoiced speech, a shape difference between the low frequency and the high frequency may be great, dynamics may be adjusted to solve this.
Referring to
Referring to
In operation 2030, a high-band excitation signal or a high-band excitation spectrum may be generated by using the decoded low-band signal. Herein, the high-band excitation signal may be generated from a narrowband time-domain signal. In addition, the high-band excitation spectrum may be generated from a modified low-band spectrum.
In operation 2050, an envelope of the high-band excitation spectrum may be predicted from the low-band spectrum based on a class of the decoded speech signal. Herein, each class may indicate a mute speech, background noise, a weak speech signal, a voiced speech, or a unvoiced speech but is not limited thereto.
In operation 2070, a high-band spectrum may be generated by applying the predicted envelope to the high-band excitation spectrum.
In operation 2090, at least one of the low-band signal and the high-band signal may be equalized. According to an embodiment, only the high-band signal may be equalized, or a full band may be equalized.
A wideband signal may be obtained by synthesizing the low-band signal and the high-band signal. Herein, the low-band signal may be the decoded speech signal or a signal which has been equalized and then transformed into the time domain. The high-band signal may be a signal to which the predicted envelope has been applied and then which has been transformed into the time domain or a signal which has been equalized and then transformed into the time domain.
In the embodiments, since a frequency-domain signal may be separated for each frequency band, a low-frequency band or a high-frequency band may be separated from a full-band spectrum and used to predict an envelope or apply an envelope according to circumstances.
One or more embodiments may be implemented in a form of a recording medium including computer-executable instructions such as a program module executed by a computer system. A non-transitory computer-readable medium may be an arbitrary available medium which may be accessed by a computer system and includes all types of volatile and nonvolatile media and separated and non-separated media. In addition, the non-transitory computer-readable medium may include all types of computer storage media and communication media. The computer storage media include all types of volatile and nonvolatile and separated and non-separated media implemented by an arbitrary method or technique for storing information such as computer-readable instructions, a data structure, a program module, or other data. The communication media typically include computer-readable instructions, a data structure, a program module, other data of a modulated signal such as a carrier, other transmission mechanism, and arbitrary information delivery media.
In addition, in the [resent disclosure, the term such as “ . . . unit” or “ . . . module” may indicate a hardware component such as a circuit and/or a software component executed by a hardware component such as a circuit.
The embodiments described above are only illustrative, and it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without changing the technical spirit and mandatory features of the present disclosure. Therefore, the embodiments should be understood in the illustrative sense only and not for the purpose of limitation in all aspects. For example, each component described as a single type may be carried out by being distributed, and likewise, components described as a distributed type may also be carried out by being coupled.
The scope of the present disclosure is defined not by the detailed description but by the appended claims, and all changed or modified forms derived from the meaning and scope of the claims and their equivalent concept will be construed as being included in the scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2014-0106601 | Aug 2014 | KR | national |
This application is a National stage entry of International Application No. PCT/KR2015/008567 filed on Aug. 17, 2015, which claims priority from U.S. Provisional Application No. 62/114,752 filed on Feb. 11, 2015 and Korean Patent Application No. 10-2014-0106601 filed on Aug. 15, 2014. The disclosures of each of the applications are herein incorporated by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2015/008567 | 8/17/2015 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2016/024853 | 2/18/2016 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6574593 | Gao | Jun 2003 | B1 |
6978236 | Liljeryd | Dec 2005 | B1 |
8078474 | Vos | Dec 2011 | B2 |
8140324 | Vos | Mar 2012 | B2 |
8260611 | Vos | Sep 2012 | B2 |
8484036 | Vos | Jul 2013 | B2 |
8655649 | Tsujino et al. | Feb 2014 | B2 |
9378746 | Choo | Jun 2016 | B2 |
20050246164 | Ojala | Nov 2005 | A1 |
20070088542 | Vos | Apr 2007 | A1 |
20070088558 | Vos | Apr 2007 | A1 |
20070282599 | Choo | Dec 2007 | A1 |
20070296614 | Lee et al. | Dec 2007 | A1 |
20080126086 | Vos | May 2008 | A1 |
20100063812 | Gao | Mar 2010 | A1 |
20130030797 | Gao | Jan 2013 | A1 |
20130262122 | Kim et al. | Oct 2013 | A1 |
Number | Date | Country |
---|---|---|
2 657 933 | Oct 2013 | EP |
10-2007-0115637 | Dec 2007 | KR |
10-2007-0118167 | Dec 2007 | KR |
10-1172326 | Aug 2012 | KR |
10-2013-0107257 | Oct 2013 | KR |
10-1398189 | May 2014 | KR |
2004064041 | Jul 2004 | WO |
2006130221 | Dec 2006 | WO |
2013141638 | Sep 2013 | WO |
Entry |
---|
International Search Report (PCT/ISA/210) and Written Opinion (PCT/ISA/237) dated Nov. 27, 2015 issued by the International Searching Authority in counterpart International Application No. PCT/KR2015/008567. |
Communication dated Dec. 19, 2017, issued by the European Patent Office in counterpart European Application No. 15832602.5. |
Number | Date | Country | |
---|---|---|---|
20170236526 A1 | Aug 2017 | US |
Number | Date | Country | |
---|---|---|---|
62114752 | Feb 2015 | US |