The present invention relates to a speech and audio signal encoding technology, and in particular, to a pitch detection method and apparatus.
To save bandwidths for transmitting and storing speech and audio signals, the speech and audio encoding technology has been widely used. The technology includes lossy encoding and lossless encoding. For lossy encoding, the reconstructed signal may not keep the same as the original signal, but the signal redundancy information may be minimized according to the features of the sound source and the human auditory perception, little coding information is transmitted and high speech and audio quality is achieved. For lossless encoding, the reconstructed signal may be the same as the original signal, so that the final decoding quality is not degraded. Generally, the lossy encoding compression efficiency is high, but the quality of the reconstructed speech and audio signal cannot be guaranteed. Lossless encoding can guarantee the speech quality because it can reconstruct signals without distortion, but the compression rate is only about 50%.
The pitch is an important parameter either in lossy encoding or lossless encoding. The final encoding performance depends on the accuracy of the pitch detection. In the prior art, a lot of pitch detection methods are available, one of which includes: mapping a signal to a domain, performing search pre-processing, performing coarse search on an open loop basis, and then performing refined search on a closed loop basis, and finally performing post-processing such as pitch smoothing. All these operations are performed in one domain, for example, time domain, frequency domain, cepstrum domain, signal domain, or residual domain.
During the implementation of the present invention, the inventor finds the prior art has the following problems: A lot of operations need be performed in different domains in the actual algorithm, and the pitch detection algorithm shows different levels of performance and complexity in different domains. For example, in the time domain, the pitch detection complexity is low; in the frequency domain, the pitch detection accuracy is higher; in the signal domain, the pitch is better, and is easy to detect; in the residual domain, the pitch is poor, and thus is difficult to detect.
Embodiments of the present invention provide a pitch detection method and apparatus to overcome the weakness of detecting a pitch in a single domain in the prior art.
To achieve the above objective, embodiments of the present invention provide the following technical solution:
A pitch detection method includes:
performing a pitch detection on an input signal in a signal domain, and obtaining a candidate pitch;
performing a linear prediction (LP) on the input signal, and obtaining an LP residual signal;
setting a candidate pitch range that includes the candidate pitch; and
searching for the LP residual signal in the candidate pitch range, and obtaining a selected pitch.
A pitch detection apparatus includes:
a signal-domain pitch detecting unit, configured to perform pitch detection on the input signal in the signal domain, and obtain a candidate pitch;
a linear predicting unit, configured to perform LP on the input signal and obtain an LP residual signal;
a setting unit, configured to set a candidate pitch range that includes the candidate pitch; and
a residual-domain refined detecting unit, configured to search for the LP residual signal in the candidate pitch range, and obtain a selected pitch.
The method and apparatus provided in some embodiments of the present invention detect pitches with different accuracy in the signal and residual domains in sequence according to different features of the signal in the two domains. This overcomes the weakness in the prior art. Thus, the complexity of the algorithm is reduced and the accuracy of the pitch detection is guaranteed.
The accompanying drawings are intended to make the present invention clearer and are part of this application, without constituting any limitation on the present invention: In the accompanying drawings:
For better understanding of the objective, technical solution and merits of the invention, embodiments of the present invention are hereinafter described in detail with reference to the accompanying drawings. Embodiments of the present invention and explanations thereof are intended to make the present invention clearer, and the present invention is not limited to such embodiments.
This embodiment provides a pitch detection method, which is hereinafter described in detail with reference to the accompanying drawings.
Block 101: Perform pitch detection on the input signal in the signal domain, and obtain a candidate pitch.
In this embodiment, some pre-processing operations may be performed on the input signal prior to the pitch detection in the signal domain, for example, low pass filtering, median clipping and down sampling; then pitch search is performed on the pre-processed signal. Thus, before block 101, the method may further include pre-processing the input signal and obtaining a pre-processed signal. The process of pre-processing may include: performing low pass filtering and down sampling on the input signal, and obtaining a down sampled signal. In this case, the down sampled signal is provided as the pre-processed signal according to one embodiment, and then the pitch detection is performed on the down sampled signal in the signal domain.
In this embodiment, a lot of signal domain pitch search methods may be available to search the pre-processed signal for the pitch. To guarantee the accuracy and continuity of the pitch, the searched pitch needs to undergo post-processing algorithms such as pitch smoothing and double frequency detection. The pitch detected in the signal domain is used as the candidate pitch for refined detection in the residual domain.
Block 102: Perform a linear prediction on the input signal, and obtain a linear prediction residual signal.
According to one embodiment, the LP residual signal may be obtained by performing linear prediction on the input signal after windowing the input signal.
Block 103: Set a candidate pitch range that includes the candidate pitch.
A lot of encoders transfer the signal to the LP residual domain for processing, these encoders need to obtain an accurate pitch according to the LP residual signal. Thus, a refined pitch needs to be searched refinedly near the candidate pitch on the residual signal to meet the requirements of the encoders.
The minimum value of the candidate pitch range is equal to the difference between the candidate pitch and a first threshold, and the maximum value of the candidate pitch range is equal to the sum of the candidate pitch and a second threshold. The first threshold and the second threshold may be determined according to the performance and complexity of the algorithm. The first threshold may be the same as or different from the second threshold.
Block 104: Search for the LP residual signal refinedly in the candidate pitch range, and obtain a selected pitch.
In this embodiment, the LP residual signal is searched refinedly based on an auto correlation function. A pitch within the candidate pitch range that enables the auto correlation function to be the largest is used as the selected pitch. The LP residual signal may also be searched by comparing the energy of the long-term prediction (LTP) residual signal. The minimum value of the energy of the LTP residual signal is selected within the candidate pitch range; a pitch corresponding to the minimum value of the energy of the LTP residual signal is used as the selected pitch (T′).
According to this embodiment, the pitch obtained through the refined search needs to undergo post-processing operations such as pitch smoothing and double frequency detection according to actual conditions, and an optimal pitch that is found through the refined detection in the residual domain is used as the selected pitch.
The method provided in this embodiment detects pitch with different accuracy in the signal and residual domains in sequence according to different features of the signal in the two domains. This overcomes the weakness of pitch detection in a single domain. Thus, the complexity of the algorithm is reduced and the accuracy of the pitch detection is guaranteed.
This embodiment provides another pitch detection method, which is hereinafter described in detail with reference to the accompanying drawings.
Block 201: Perform low pass filtering on the input signal s(n), and obtain a low pass filtered signal y(n):
where n=0, 1, . . . , L
Block 202: The low pass filtered signal y(n) is downsampled, and obtain a downsampled signal y2(n):
y2(n)=y(2n), where
Block 203: Pitch search is performed for the downsampled signal y2(n).
Because the pitch generally ranges from 2 ms to 20 ms, the pitch range is limited to [20, 83] (8 kHz sampling) in this embodiment and the pitch parameter may be encoded with 6-bit in consideration of encoding efficiency and performance. In addition, the pitch cannot be too long for the frame length of 160 samples; otherwise, few samples in a frame signal participate in the LTP calculation, which may reduce the LTP performance.
In one embodiment, assume that L is equal to 160 samples. In the down sampled signal domain, the pitch range is changed to [10, 41], that is, PMIN=10 and PMAX=41, as shown in
In one embodiment, step 203 may further include:
Block 2031: According to the pitch range, find a pulse with the maximum amplitude in the second half-frame signal of the down sampled signal in the down sampled signal domain, where the pulse position is recorded as p0.
Block 2032: Add a target window with the size of [smin, smax] around p0, where:
and the window length (len) is equal to the difference between smax and smin, where s_max( ) denotes returning a maximum value in the included elements; and s_min( ) denotes returning a minimum value in the included elements.
Block 2033: Obtain an initial pitch according to the pre-processed signal in the target window and sliding windows of the target window.
In this embodiment, the method for obtaining the initial pitch includes but is not limited to the following three methods:
First Method
Calculate the energy E(k) of the LTP residual signal xk(i), and use the pitch corresponding to the minimum energy as the initial pitch:
xk(i)=y2(i)−g·y2(i−k),i=smin, . . . , smax,
where g indicates an LTP gain factor and kε[10,41].
Then,
where kε[10,41].
Select the minimum value in E(k) and the pitch corresponding to the minimum value as follows:
P={E(P)<E(k),kε[10,41],k≠P}.
Second Method
Match the signals around the pulse with the maximum amplitude in the down sampled signal, obtain the correlation coefficients by calculating the following correlation function, and use the pitch corresponding to the maximum correlation coefficient as the initial pitch.
The correlation function may be
where kε[10,41]. The k value corresponding to the maximum correlation coefficient (corr [.]) is used as the initial pitch (P).
Third Method
Calculate the sum of absolute values of the LTP residual signal xk(i), and use the pitch corresponding to the minimum sum of absolute values as the initial pitch:
xk(i)=y2(i)−g·y2(i−k),i=smin, . . . , smax,
where g indicates an LTP gain factor and kε[10,41].
where kε[10, 41].
Select the minimum value in E(k) and the pitch corresponding to the minimum value as follows:
P={E(P)>E(k),kε[10,41],k≠P}.
Block 2034: To avoid mistaking the double value of the initial pitch as the initial pitch, compare the initial pitch with a pitch twice the initial pitch as follows according to one embodiment:
where L indicates the frame length and p is equal to P and 2P.
The p in the preceding two pitches (P and 2P) that enable nor_cor[.] to be the largest is used as the candidate pitch, which may be set to T in this embodiment.
Block 204: Window the input signal, perform LP on the input signal, and obtain an LP residual signal e(n).
Block 205: The refined pitch search is performed for the LP residual signal e(n) in the range of [T−Td1,T+Td2], and obtain the selected pitch.
In one embodiment, the pitch may be searched out by using an auto correlation function. Considering the encoding efficiency and performance, the auto correlation function may be represented as one of the following three formulas:
The k value within the range of [T−Td1,T+Td2] that enables nor_cor[.] to be the largest is used as the optimal pitch (T′), that is, the selected pitch. The first threshold (Td1) and the second threshold (Td2) may be determined according to the performance and complexity of the algorithm. For example, both Td1 and Td2 may be set to 2.
In another embodiment, the pitch may be searched out by comparing the energy of the LTP residual signal as follows:
uk(n)=e(n)−g′·e(n−k),i=k, . . . , L−1,
where uk(n) indicates the LTP residual signal, g′ indicates the LTP gain factor and kε[T−Td1,T+Td2].
kε[T−Td1,T+Td2]. Alternatively, E(k) may also be represented by the sum of absolute values of uk(n).
The minimum value in E(k) is selected and a pitch corresponding to the minimum value is used as the selected pitch (T′).
In this embodiment, according to different features of the signal in different domains and requirements of the actual algorithm, a pitch is searched coarsely in the signal domain and then a refined pitch search is performed in the residual domain according to the pitch obtained in the coarse search. The method provided in this embodiment detects pitches with different accuracy in the signal and residual domains in sequence according to different features of the signal in the two domains. This overcomes the weakness in the prior art. Thus, the complexity of the algorithm is reduced and the accuracy of the pitch detection is guaranteed.
This embodiment provides a pitch detection apparatus, which is hereinafter described in detail with reference to the accompanying drawing.
a signal-domain pitch detecting unit 41, configured to detect the pitch of the input signal in the signal domain, and obtain a candidate pitch;
a linear predicting unit 42, configured to perform LP on the input signal, and obtain an LP residual signal;
a setting unit 43, configured to set a candidate pitch range that includes the candidate pitch; and a residual-domain refined detecting unit 44, configured to search for the LP residual signal refinedly within the candidate pitch range, and obtain a selected pitch.
The components of the apparatus provided in this embodiment are configured to implement each step of the method in the Embodiment 1 of the present invention. Because each step of the method has been described in detail in the first embodiment, these components will not be further described.
The apparatus provided in this embodiment detects pitches with different accuracy in the signal and residual domains in sequence according to different features of the signal in the two domains. This overcomes the weakness in the prior art. Thus, the complexity of the algorithm is reduced and the accuracy of the pitch detection is guaranteed.
This embodiment provides a pitch detection apparatus, which is hereinafter described in detail with reference to the accompanying drawing.
a pre-processing unit 55, configured to pre-process the input signal, obtain a pre-processed signal, and provide the pre-processed signal to the signal-domain pitch detecting unit 51 in the signal domain.
The pre-processing unit 55 may include:
a low pass filtering module 551, configured to perform low pass filtering on the input signal; and
a down sampling module 552, configured to down sample the input signal that has undergone the low pass filtering by the low pass filtering module 551, and obtain a down sampled signal.
In one embodiment, the signal domain pitch detecting unit 51 may include:
a first windowing module 511, configured to add a target window around a pulse position with the maximum amplitude in the second half-frame signal of the pre-processed signal;
an initial pitch obtaining module 512, configured to obtain an initial pitch according to the pre-processed signal in the target window and sliding windows of the target window; and
a candidate pitch obtaining module 513, configured to perform double frequency detection on the initial pitch, and obtain a candidate pitch.
The initial pitch obtaining module 512 may be configured to calculate the energy of the LTP residual signal according to the pre-processed signal in the target window and sliding windows of the target window, and use a pitch corresponding to the minimum energy as the initial pitch; or match the signal around a pulse with the maximum amplitude in the pre-processed signal, calculate a correlation coefficient, and use a pitch corresponding to the maximum correlation coefficient as the initial pitch; or calculate the sum of absolute values of the LTP residual signal according to the pre-processed signal in the target window and sliding windows of the target window, and use a pitch corresponding to the minimum sum of absolute values as the initial pitch.
In one embodiment, the linear predicting unit 52 may include:
a second windowing module 521, configured to window the input signal; and
a linear predicting module 522, configured to perform LP on the input signal windowed by the windowing module 521, and obtain an LP residual signal.
In one embodiment, the residual-domain refined detecting unit 54 may include:
a refined searching module 541, configured to search for the LP residual signal refinedly by using an auto correlation function or comparing the energy of the LTP residual signal; and
a selected pitch obtaining module 542, configured to use a pitch that enables the auto correlation function to be the largest or the energy of the LTP residual signal to be the smallest within the candidate pitch range as the selected pitch.
The components of the apparatus provided in this embodiment are configured to implement each step of the method in the second embodiment of the present invention. Because each step of the method has been described in detail in the second embodiment, these components will not be further described.
The apparatus provided in this embodiment detects pitches with different accuracy in the signal and residual domains in sequence according to different features of the signal in the two domains. This overcomes the weakness in the prior art. Thus, the complexity of the algorithm is reduced and the accuracy of the pitch detection is guaranteed.
Detailed above are the objective, technical solution and merits of the present invention. Although the present invention has been described through several exemplary embodiments and accompanying drawings, the invention is not limited to such embodiments. It is apparent that those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. The invention shall cover the modifications and variations provided that they fall in the scope of protection defined by the following claims or their equivalents.
This application is a continuation of and claims priority to International Application No. PCT/CN2009/070423, filed on Feb. 13, 2009, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4561102 | Prezas | Dec 1985 | A |
5717829 | Takagi | Feb 1998 | A |
5781880 | Su | Jul 1998 | A |
5884010 | Chen et al. | Mar 1999 | A |
5999897 | Yeldener | Dec 1999 | A |
6243672 | Iijima et al. | Jun 2001 | B1 |
6871176 | Choi et al. | Mar 2005 | B2 |
6931373 | Bhaskar et al. | Aug 2005 | B1 |
6954726 | Brandel et al. | Oct 2005 | B2 |
6996523 | Bhaskar et al. | Feb 2006 | B1 |
7013269 | Bhaskar et al. | Mar 2006 | B1 |
7155386 | Gao | Dec 2006 | B2 |
20030074192 | Choi et al. | Apr 2003 | A1 |
20030149560 | Chen | Aug 2003 | A1 |
20030171917 | Zhu et al. | Sep 2003 | A1 |
20030177001 | Chen | Sep 2003 | A1 |
20030177002 | Chen | Sep 2003 | A1 |
20040002856 | Bhaskar et al. | Jan 2004 | A1 |
20040013245 | Yokoyama | Jan 2004 | A1 |
20040181397 | Gao | Sep 2004 | A1 |
20050021325 | Seo et al. | Jan 2005 | A1 |
20080253552 | Riera-Palou et al. | Oct 2008 | A1 |
20080270124 | Son et al. | Oct 2008 | A1 |
20090299736 | Sato | Dec 2009 | A1 |
20100049510 | Zhan et al. | Feb 2010 | A1 |
20100063827 | Gao | Mar 2010 | A1 |
20100174535 | Vos et al. | Jul 2010 | A1 |
Number | Date | Country |
---|---|---|
1412742 | Apr 2003 | CN |
101030374 | Sep 2007 | CN |
101030375 | Sep 2007 | CN |
101325631 | Dec 2008 | CN |
1 587 061 | Oct 2005 | EP |
Entry |
---|
Written Opinion of the International Searching Authority dated Nov. 19, 2009 in connection with International Patent Application No. PCT/CN2009/070423. |
Office Action dated Jun. 30, 2011 in connection with Chinese Patent Application No. 200980000112.4. |
“A G711 Lossless Compression Algorithm Proposal”, Cisco Systems, Inc., Telecommunication Standardization Sector, Oct. 8-12, 2007, 21 pages. |
Rongshan Yu, et al., “MPEG-4 Scalable to Lossless Audio Coding”, Audio Engineering Society, Convention Paper 6183, Oct. 28-31, 2004, 14 pages. |
Tilman Liebchen, et al., “MPEG-4 Audio Lossless Coding”, Audio Engineering Society, Convention Paper 6047, May 8-11, 2004, 9 pages. |
“G711 Lossless Compression Algorithm: Market Need, Use Cases and Design Requirements”, Cisco Systems, Inc., Telecommunication Standardization Sector, Oct. 8-12, 2007, 7 pages. |
International Search Report dated Nov. 19, 2009 in connection with International Patent Application No. PCT/CN2009/070423. |
Number | Date | Country | |
---|---|---|---|
20100211384 A1 | Aug 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2009/070423 | Feb 2009 | US |
Child | 12798715 | US |