1. Field
An apparatus and a method consistent with exemplary embodiments broadly relate to an audio correction apparatus and an audio correction method thereof, and more particularly, to an audio correction apparatus which detects onset information and pitch information of audio data and corrects the audio data according to onset information and pitch information of reference audio data, and an audio correction method thereof.
2. Description of Related Art
Technique for correcting a song, which is sung by an ordinary person who sings badly based on a score, are known. In particular, a related-art method for correcting the pitch of a song which is sung by a person according to the pitch of a score to correct the song is known.
However, a song which is sung by a person or a sound which is generated when a string instrument is played includes a soft onset in which notes are connected with one another. That is, in the case of a song which is sung by a person or a sound which is generated when a string instrument is played, when only the pitch is corrected without searching the onset which is a start point of each note, there may be a problem that the note is lost in the middle of the song or performance or the pitch is corrected from a wrong note.
An aspect of exemplary embodiments is to provide an audio correction apparatus, which detects an onset and pitch of audio data and corrects the audio data according to the onset and pitch of reference audio data, and an audio correction method.
According to an aspect of an exemplary embodiment, an audio correction method includes: receiving audio data; detecting onset information by analyzing harmonic components of the received audio data; detecting pitch information of the received audio data based on the detected onset information; aligning the received audio data with the reference audio data based on the detected onset information and the detected pitch information; and correcting the aligned audio data to match the reference audio data.
The detecting the onset information may include cepstral analyzing the received audio data; analyzing the harmonic components of the cepstral-analyzed audio data, and detect the onset information based on the analyzing of the harmonic components.
The detecting the onset information may include: cepstral analyzing the received audio data; selecting a harmonic component of a current frame using a pitch component of a previous frame; calculating cepstral coefficients with respect to a plurality of harmonic components using the selected harmonic component of the current frame and the harmonic component of the previous frame; generating a detection function by calculating a sum of the calculated cepstral coefficients of the plurality of harmonic components; extracting an onset candidate group by detecting a peak of the generated detection function; and detecting the onset information by removing a plurality of adjacent onsets from the extracted onset candidate group.
The calculating may include determining whether the previous frame has the harmonic component, in response to the determining yielding that the harmonic component of the previous frame exists, calculating a high cepstral coefficient, and, in response to the determining yielding that no harmonic component of the previous frame exists, calculating a low cepstral coefficient.
The detecting the pitch information may include detecting the pitch information between the detected onset components using a correntropy pitch detection method.
The aligning may include comparing the received audio data with the reference audio data and aligning the received audio data with the reference audio data using a dynamic time warping method.
The aligning may include calculating an onset correction ratio and a pitch correction ratio of the received audio data to correspond to the reference audio data.
The correcting may include correcting the aligned audio data based on the calculated onset correction ratio and the pitch correction ratio.
The correcting may include correcting the aligned audio data by preserving a formant of the audio data using a SOLA method.
According to yet another aspect of an exemplary embodiment, an audio correction apparatus includes: an inputter configured to receive audio data; an onset detector configured to detect onset information by analyzing harmonic components of the audio data; a pitch detector configured to detect pitch information of the audio data based on the detected onset information; an aligner configured to align the audio data with the reference audio data based on the onset information and the pitch information; and a corrector configured to correct the audio data, which aligned with the reference audio data by the aligner, to match the reference audio data.
The onset detector may detect the onset information by cepstral analyzing the audio data and by analyzing the harmonic components of the cepstral-analyzed audio data.
The onset detector may include: a cepstral analyzer to perform a cepstral analysis of the audio data; a selector to select a harmonic component of a current frame using a pitch component of a previous frame; a coefficient calculator to calculate cepstral coefficients of a plurality of harmonic components using the selected harmonic component of the current frame and the harmonic component of the previous frame; a function generator to generate a detection function by calculating a sum of the cepstral coefficients of the plurality of harmonic components calculated by the coefficient calculator; an onset candidate group extractor to extract an onset candidate group by detecting a peak of the detection function generated by the function generator; and an onset information detector to detect the onset information by removing a plurality of adjacent onsets from the onset candidate group extracted by the onset candidate group extractor.
The audio correction apparatus may further include a harmonic component determiner to determine whether the previous frame has the harmonic component. In response to the harmonic component determiner determining that the harmonic component of the previous frame exists, the coefficient calculator may calculate a high cepstral coefficient, and, in response to the harmonic component determiner determining that no harmonic component of the previous frame exists, the coefficient calculator may calculate a low cepstral coefficient.
The pitch detector may detect the pitch information between the detected onset components using a correntropy pitch detection method.
The aligner may compare the audio data with the reference audio data and align the audio data with the reference audio data using a dynamic time warping method.
The aligner may calculate an onset correction ratio and a pitch correction ratio of the audio data with respect to the reference audio data.
The corrector may correct the audio data according to the calculated onset correction ratio and the calculated pitch correction ratio.
The corrector may correct the audio data by preserving a formant of the audio data using a SOLA method.
According to one or more exemplary embodiments, an onset detection method of an audio correction apparatus may include: performing cepstral analysis with respect to the audio data; selecting a harmonic component of a current frame using a pitch component of a previous frame; calculating cepstral coefficients with respect to a plurality of harmonic components using the harmonic component of the current frame and the harmonic component of the previous frame; generating a detection function by calculating a sum of the cepstral coefficients of the plurality of harmonic components; extracting an onset candidate group by detecting a peak of the detection function; and detecting the onset information by removing a plurality of adjacent onsets from the onset candidate group.
According to the above-described various exemplary embodiments, an onset can be detected from audio data in which the onsets are not clearly distinguished, such as a song which is sung by a person or a sound of a string instrument, and thus the audio data can be corrected more precisely.
These and/or other aspects will become apparent and more readily appreciated from the following description of the exemplary embodiments, taken in conjunction with the accompanying drawings in which:
Hereinafter, exemplary embodiments will be explained in detail with reference to the accompanying drawings.
First, the audio correction apparatus receives an input of audio data (in operation S110). According to an exemplary embodiment, the audio data may be data which includes a song which is sung by a person or a sound which is made by a musical instrument.
The audio correction apparatus may detect onset information by analyzing harmonic components (in operation S120). The onset refers to a point where a musical note generally starts. However, the onset on a human voice may not be clear like glissandos, portamenti, and slur. Therefore, according to an exemplary embodiment, an onset included in a song which is sung by a person may refer to a point where a vowel starts.
In particular, the audio correction apparatus may detect the onset information using a Harmonic Cepstrum Regularity (HCR) method. The HCR method detects onset information by performing cepstral analysis with respect to audio data and analyzing harmonic components of the cepstral-analyzed audio data.
The method for the audio correction apparatus to detect the onset information by analyzing the harmonic components according to an exemplary embodiment will be explained in detail with reference to
First, the audio correction apparatus performs cepstral analysis with respect to the input audio data (in operation S121). Specifically, the audio correction apparatus may perform a pre-process such as pre-emphasis with respect to the input audio data. In addition, the audio correction apparatus performs fast Fourier transform (FFT) with respect to the input audio data. In addition, the audio correction apparatus may calculate the logarithm of the transformed audio data, and may perform the cepstral analysis by performing discrete cosine transform (DCT) with respect to the audio data.
In addition, the audio correction apparatus selects a harmonic component of a current frame (in operation S122). Specifically, the audio correction apparatus may detect pitch information of a previous frame and select a harmonic quefrency which is a harmonic component of a current frame using the pitch information of the previous frame.
In addition, the audio correction apparatus calculates a cepstral coefficient with respect to a plurality of harmonic components using the harmonic component of the current frame and the harmonic component of the previous frame (in operation S123). According to an exemplary embodiment, when there is a harmonic component of a previous frame, the audio correction apparatus calculates a high cepstral coefficient, and, when there is no harmonic component of a previous frame, the audio correction apparatus may calculate a low cepstral coefficient.
In addition, the audio correction apparatus generates a detection function by calculating a sum of the cepstral coefficients for the plurality of harmonic components (in operation S124). Specifically, the audio correction apparatus receives an input of audio data including a voice signal, as shown in
In addition, the audio correction apparatus extracts an onset candidate group by detecting the peak of the generated detection function (in operation S125). Specifically, when another harmonic component appears in the middle of existence of harmonic components, that is, at a point where an onset occurs, the cepstral coefficient abruptly changes. Therefore, the audio correction apparatus may extract a peak point where the detection function, which is the sum of the cepstral coefficients of the plurality of harmonic components, is abruptly changed. According to an exemplary embodiment, the extracted peak point may be set as the onset candidate group.
In addition, the audio correction apparatus detects onset information between the onset candidate groups (in operation S126). Specifically, from among the onset candidate groups extracted in operation S125, a plurality of onset candidate groups may be extracted from adjacent sections. The plurality of onset candidate groups extracted from the adjacent sections may be onsets which occur when the human voice trembles or other noises come in. Therefore, the audio correction apparatus may remove the other onset candidate groups except for only one onset candidate group from among the plurality of onset candidate groups of the adjacent sections, and detects only the one onset candidate group as onset information.
By detecting the onset through the cepstral analysis, as described above, according to an exemplary embodiment, an exact onset can be detected from audio data in which onsets are not clearly distinguished like in a song which is sung by a person or a sound which is made by a string instrument.
Table 1 presented below shows a result of detecting an onset using the HCR method, according to an exemplary embodiment:
As described above, it can be seen that F-measures of various sources are calculated as 0.60-0.79. That is, considering that F-measure detected by various related-art algorithms is 0.19-0.56, an onset can be detected more exactly using the HCR method according to an exemplary embodiment.
Referring back to
In an exemplary embodiment, the audio correction apparatus divides a signal between the onsets (in operation S131). Specifically, the audio correction apparatus may divide a signal between the plurality of onsets based on the onset detected in operation S120.
In addition, the audio correction apparatus may perform gammatone filtering with respect to the input signal (in operation S132). Specifically, the audio correction apparatus applies 64 gammatone filters to the input signal. In an exemplary embodiment, the frequency of the plurality of gammatone filters is divided according to a bandwidth. In addition, the intermediate frequency of the filter is divided by the same interval, and the bandwidth is set between 80 Hz and 400 Hz.
In addition, the audio correction apparatus generates a correntropy function with respect to the input signal (in operation S133). It is common that the correntropy can obtain higher-dimensional statistics than in the related-art auto-correlation. Therefore, according to an exemplary embodiment, when a human voice is corrected, a frequency resolution is higher than in the related-art auto-correlation. The audio correction apparatus may obtain a correntropy function, as shown in Equation 1 presented below:
V(t,s)=E[k(x(t),x(s))] Equation 1
x(t) and x(s) indicate an input signal when time is t and s respectively.
In this case, k(*,*) may be a kernel function which has a positive value and a symmetric characteristic. According to an exemplary embodiment, the kernel function may use Gaussian kernel. The correntropy function which is substituted with the equation of the Gaussian kernel and the Gaussian kernel may be expressed by Equation 2 and 3 presented below:
In addition, the audio correction apparatus detects the peak of the correntropy function (in operation S134). Specifically, when the correntropy is calculated, the audio correction apparatus may output a higher frequency resolution with respect to the input audio data than in the auto-correction, and detect a sharper peak than the frequency of the corresponding signal. According to an exemplary embodiment, the audio correction apparatus may measure the frequency which is greater than or equal to a predetermined threshold value from among the calculated peaks as a pitch of the input voice signal. More specifically,
In addition, the audio correction apparatus may detect a pitch sequence based on the detected pitch (in operation S135). Specifically, the audio correction apparatus may detect pitch information with respect to the plurality of onsets and may detect a pitch sequence for every onset.
In the above-described exemplary embodiment, the pitch is detected using the correntropy pitch detection method. However, this is merely an example and not by way of a limitation, and the pitch of the audio data may be detected using other methods (for example, the auto-correlation method).
Referring back to
In particular, the audio correction apparatus may align the audio data with the reference audio data using a dynamic time warping (DTW) method. Specifically, the dynamic time warping method is an algorithm for finding an optimum warping path by comparing similarity between the two sequences.
Specifically, the audio correction apparatus may detect sequence X with respect to the audio data input using operations S120 and S130, as shown in
In particular, according to an exemplary embodiment, the audio correction apparatus may detect an optimum path for pitch information, as shown with a dotted line in
According to an exemplary embodiment, the audio correction apparatus may calculate an onset correction ratio and a pitch correction ratio of the audio data with respect to the reference audio data while calculating the optimum path. The onset correction ratio may be a ratio for correcting the length of time of the input audio data (time stretching ratio), and the pitch correction ratio may be a ratio for correcting the frequency of the input audio data (pitch shifting ratio).
Referring back to
In particular, the audio correction apparatus may correct the onset information of the audio data using a phase vocoder. Specifically, the phase vocoder may correct the onset information of the audio data through analysis, modification, and synthesis. In an exemplary embodiment, the onset information correction in the phase vocoder may stretch or reduce the time of the input audio data by differently setting an analysis hopsize and a synthesis hopsize.
In addition, the audio correction apparatus may correct the pitch information of the audio data using the phase vocoder. According to an exemplary embodiment, the audio correction apparatus may correct the pitch information of the audio data using a change in the pitch which occurs when a time scale is changed through re-sampling. Specifically, the audio correction apparatus performs time stretching 152 with respect to the input audio data 151, as shown in
In addition, when the audio correction apparatus corrects the pitch through re-sampling, the input audio data may be multiplied with an alignment coefficient_P, which is pre-determined to maintain a formant even after re-sampling, in advance, in order to prevent the formant from being changed. The alignment coefficient P may be calculated by Equation 4 presented below:
In this case, A(k) is a formant envelope.
In addition, in the case of a general phase vocoder, distortion such as ringing may be caused. This is a problem which is caused by phase discontinuity of a time axis which occurs by correcting phase discontinuity of a frequency axis. To solve this problem, according to an exemplary embodiment, the audio correction apparatus may correct the audio data by preserving the formant of the audio data using a synchronized overlap add (SOLA) algorithm. Specifically, the audio correction apparatus may perform phase vocoding with respect to some initial frames, and then, may remove the discontinuity which occurs on the time axis by synchronizing the input audio data with data which undergoes the phase vocoding.
According to the above-described audio correction method of an exemplary embodiment, the onset can be detected from the audio data in which the onsets are not clearly distinguished, such as a song which is sung by a person or a sound of a string instrument, and thus, the audio data can be corrected more exactly or precisely.
Hereinafter, an audio correction apparatus 800 according to an exemplary embodiment will be explained in detail with reference to
The inputter 810 receives an input of audio data. According to an exemplary embodiment, the audio data may be a song which is sung by a person or a sound of a string instrument. An inputter may be a microphone with a sensor configured to detect audio signals.
The onset detector 820 may detect an onset by analyzing harmonic components of the input audio data. Specifically, the onset detector 820 may detect onset information by performing cepstral analysis with respect to the audio data and then analyzing the harmonic components of the cepstral-analyzed audio data. In particular, the onset detector 820 performs cepstral analysis with respect to the audio data as shown in
The pitch detector 830 detects pitch information of the audio data based on the detected onset information. According to an exemplary embodiment, the pitch detector 830 may detect pitch information between the onset components using a correntropy pitch detection method. However, this is merely an example and not by way of a limitation, and the pitch information may be detected using other methods.
The aligner 840 compares the input audio data and reference audio data and aligns the input audio data with reference audio data based on the detected onset information and pitch information. In this case, the aligner 840 may compare the input audio data and the reference audio data and align the input audio data with the reference audio data using a dynamic time warping method. According to an exemplary embodiment, the aligner 840 may calculate an onset correction ratio and a pitch correction ratio of the input audio data with respect to the reference audio data.
The corrector 850 may correct the input audio data aligned with the reference audio data to match the reference audio data. In particular, the corrector 850 may correct the input audio data according to the calculated onset correction ratio and pitch correction ratio. In addition, the corrector 850 may correct the input audio data using an SOLA algorithm to prevent a change of a formant which may be caused when the onset and pitch are corrected. In an exemplary embodiment, the onset detector 820, the pitch detector 830, the aligner 840, and the corrector 850 may be implemented by a hardware processor or a combination of processors. The corrected input audio data may be output via speakers (not shown).
The above-described audio correction apparatus 800 can detect the onset from the audio data in which the onsets are not clearly distinguished, such as a song which is sung by a person or a sound of a string instrument, and thus can correct the audio data more exactly and/or precisely.
In particular, when the audio correction apparatus 800 is implemented by using a user terminal such as a smartphone, exemplary embodiments may be applicable to various scenarios. For example, the user may select a song that the user wants to sing. The audio correction apparatus 800 obtains reference MIDI data of the song selected by the user. When a record button is selected by the user, the audio correction apparatus 800 displays a score and guides the user to sing the song more exactly or precisely i.e., more closely to how it should be sung. When the recording of the user's song is completed, the audio correction apparatus 800 corrects the user's song, according to an exemplary embodiment described above with reference to
The audio correction method of the audio correction apparatus 800 according to the above-described various exemplary embodiments may be implemented as a program and provided to the audio correction apparatus 800. In particular, the program including the sensing method of the mobile device 100 may be stored in a non-transitory computer readable medium and provided for use by the device.
The non-transitory computer readable medium refers to a medium that stores data semi-permanently rather than storing data for a very short time, such as a register, a cache, and a memory, and is readable by an apparatus. Specifically, the above-described various applications or programs may be stored in a non-transitory computer readable medium such as a compact disc (CD), a digital versatile disk (DVD), a hard disk, a Blu-ray disk, a universal serial bus (USB), a memory card, and a read only memory (ROM), and may be provided for use by a device.
The foregoing exemplary embodiments are merely exemplary and are not to be construed as limiting the present inventive concept. The exemplary embodiments can be readily applied to other types of apparatuses. Also, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.
Number | Date | Country | Kind |
---|---|---|---|
10-2013-0157926 | Dec 2013 | KR | national |
This application claims the benefit of priority from Korean Patent Application No. 10-2013-0157926, filed on Dec. 18, 2013 and U.S. Provisional Application No. 61/740,160 filed on Dec. 20, 2012, the disclosures of which are incorporated herein by reference in their entireties. This application is a National Stage Entry of the PCT Application No. PCT/KR2013/011883 filed on Dec. 19, 2013, the entire disclosure of which is also incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2013/011883 | 12/19/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/098498 | 6/26/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5749073 | Slaney | May 1998 | A |
20080190271 | Taub | Aug 2008 | A1 |
20090271197 | Nyquist et al. | Oct 2009 | A1 |
20100011939 | Nakadai et al. | Jan 2010 | A1 |
20100299144 | Barzelay | Nov 2010 | A1 |
20110004467 | Taub | Jan 2011 | A1 |
Number | Date | Country |
---|---|---|
2010-26512 | Feb 2010 | JP |
2005010865 | Feb 2005 | WO |
Entry |
---|
Search Report dated Apr. 22, 2014 issued by the International Searching Authority in counterpart International Patent Application No. PCT/KR2013/011883 (PCT/ISA/210). |
Written Opinion dated Apr. 22, 2014 issued by the International Searching Authority in counterpart International Patent Application No. PCT/KR2013/011883 (PCT/ISA/237). |
Number | Date | Country | |
---|---|---|---|
20150348566 A1 | Dec 2015 | US |
Number | Date | Country | |
---|---|---|---|
61740160 | Dec 2012 | US |