This application claims the benefit of Korean Patent Application No. 10-2004-0008740, filed on Feb. 10, 2004, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Invention
The present invention relates to an apparatus, method, and medium for detecting a voiced sound and an unvoiced sound, and more particularly, to an apparatus, method, and medium for detecting a voiced sound zone and an unvoiced sound zone using a spectral flatness measure (SFM) and a slope of a mel-scaled filter bank spectrum obtained from a voice signal in a predetermined zone.
2. Description of the Related Art
Various encoding methods that perform signal compression using statistical attributes and human auditory characteristics of a voice signal in a time domain or frequency domain have been suggested. To encode a voice signal, information determining whether the input voice signal is a voiced sound or an unvoiced sound is typically used. A method of detecting a voiced sound and an unvoiced sound from an input voice signal can be divided into a method performed in the time domain and a method performed in the frequency domain. The method performed in the time domain complexly uses at least one of a frame average energy of a voice signal and a zero-cross rate, and the method performed in the frequency domain uses information on low frequency and high frequency components of the voice signal or pitch harmonic information. If the conventional methods described above are used in a clean environment, satisfactory detection performance can be guaranteed. However, if the conventional methods described above are used in a white noise environment, the detection performance is considerably deteriorated.
Embodiments of the present invention provide an apparatus, method, and medium for detecting a voiced sound zone and an unvoiced sound zone from a voice signal in a block preferably by dividing the voice signal into units of predetermined size of blocks and using a spectral flatness measure (SFM) and a slope of a mel-scaled filter bank spectrum obtained from the voice signal existing in the block.
To achieve the above and/or other aspects and advantages, embodiments of the present invention include a method of detecting a voiced sound and an unvoiced sound, the method including dividing an input signal into block units, calculating a slope and a spectral flatness measure (SFM) of a mel-scaled filter bank spectrum, calculating a first parameter to determine the voiced sound and a second parameter to determine the unvoiced sound by using the slope and the spectral flatness measure (SFM) of the mel-scaled filter bank spectrum of the input signal existing in a block, and determining a voiced sound zone and an unvoiced sound zone in the block by comparing the first and the second parameters to predetermined threshold values.
The calculating of the slope and SFM may include calculating the slope by modeling the mel-scaled filter bank spectrum as a first order function, and calculating the SFM using a geometric average and an arithmetic average of a spectrum obtained by removing the slope from the mel-scaled filter bank spectrum.
The determining of the voiced sound zone and the unvoiced sound zone may include comparing a first signal waveform obtained by applying the first parameter obtained from the slope to the input signal of the block and a first threshold value, comparing a second signal waveform obtained by applying the second parameter obtained from the slope and SFM to the input signal of the block and a second threshold value, determining a zone, which has a value larger than the first threshold value in the first signal waveform as a result of the comparing of the first signal waveform and the first threshold value, as a voiced sound zone, and determining a zone, which has a value larger than the second threshold value in the second signal waveform as a result of the comparing of the second signal waveform and the second threshold value, as an unvoiced sound zone.
The first parameter may be obtained using a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum.
The first parameter may be obtained using a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum and a second slope calculated at a predetermined low frequency area of the entire frequency area.
The first parameter may be obtained using a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum, a second slope calculated at a predetermined low frequency area of the entire frequency area, and a third slope calculated at a predetermined high frequency area of the entire frequency area.
The second parameter may be obtained by a difference between the SFM and the slope calculated at the entire frequency area of the mel-scaled filter bank spectrum.
To achieve the above and/or other aspects and advantages, embodiments of the present invention include an apparatus for detecting a voiced sound and an unvoiced sound, the apparatus including a blocking unit for dividing an input signal into block units, a parameter calculator for calculating a first parameter to determine the voiced sound and a second parameter to determine the unvoiced sound by using a slope and spectral flatness measure (SFM) of a mel-scaled filter bank spectrum of the input signal existing in a block, and a determiner for determining a voiced sound zone and an unvoiced sound zone in the block by comparing the first and second parameters to predetermined threshold values.
The parameter calculator may include a first spectrum acquisitor obtaining a mel-scaled filter bank spectrum from an input signal existing in a block provided from the blocking unit, a first parameter calculator calculating a slope of the mel-scaled filter bank spectrum provided from the first spectrum acquisitor and a first parameter to determine the voiced sound using the slope, a second spectrum acquisitor obtaining a second spectrum in which the slope at an entire frequency area is removed from the mel-scaled filter bank spectrum, and a second parameter calculator calculating a spectral flatness measure (SFM) of the second spectrum provided from the second spectrum acquisitor and a second parameter to determine the unvoiced sound using the slope and SFM.
The first parameter calculator may set a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum as the first parameter.
The first parameter calculator may add a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum to a second slope calculated at a predetermined low frequency area of the entire frequency area, and then set the added result as the first parameter.
The first parameter calculator may adds a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum, a second slope calculated at a predetermined low frequency area of the entire frequency area, and a third slope calculated at a predetermined high frequency area of the entire frequency area and sets the added result as the first parameter.
The second parameter calculator may set a difference between the SFM and the slope calculated at the entire frequency area of the mel-scaled filter bank spectrum as the second parameter.
The determiner may compare a first signal waveform obtained by applying the first parameter obtained from the slope to the input signal of the block and a first threshold value and determines a zone, which has a value larger than the first threshold value in the first signal waveform as a result of the comparing of the first signal waveform and the first threshold value, as a voiced sound zone.
The determiner may compare a second signal waveform obtained by applying the second parameter obtained from the slope and SFM to the input signal of the block and a second threshold value and determines a zone, which has a value larger than the second threshold value in the second signal waveform as a result of the comparing of the second signal waveform and the second threshold value, as an unvoiced sound zone.
To achieve the above and/or other aspects and advantages, embodiments of the present invention include a medium which includes computer-readable instructions, for detecting a voiced sound and an unvoiced sound, the medium including dividing an input signal into block units, calculating a slope and a spectral flatness measure (SFM) of a mel-scaled filter bank spectrum, calculating a first parameter to determine the voiced sound and a second parameter to determine the unvoiced sound by using the slope and the spectral flatness measure (SFM) of a mel-scaled filter bank spectrum of the input signal existing in a block, and determining a voiced sound zone and an unvoiced sound zone in the block by comparing the first and the second parameters to predetermined threshold values.
Calculating the slope and SFM may include calculating the slope by modeling the mel-scaled filter bank spectrum as a first order function, and calculating the SFM using a geometric average and an arithmetic average of a spectrum obtained by removing the slope from the mel-scaled filter bank spectrum.
Determining the voiced sound zone and the unvoiced sound zone may include comparing a first signal waveform obtained by applying the first parameter obtained from the slope to the input signal of the block and a first threshold value, comparing a second signal waveform obtained by applying the second parameter obtained from the slope and SFM to the input signal of the block and a second threshold value, determining a zone, which has a value larger than the first threshold value in the first signal waveform as a result of the comparing of the first signal waveform and the first threshold value, as a voiced sound zone, and determining a zone, which has a value larger than the second threshold value in the second signal waveform as a result of the comparing of the second signal waveform and the second threshold value, as an unvoiced sound zone.
The first parameter may be obtained using a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum.
The first parameter may be obtained using a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum and a second slope calculated at a predetermined low frequency area of the entire frequency area.
The first parameter may be obtained using a first slope calculated at an entire frequency area of the mel-scaled filter bank spectrum, a second slope calculated at a predetermined low frequency area of the entire frequency area, and a third slope calculated at a predetermined high frequency area of the entire frequency area.
The second parameter may be obtained by a difference between the SFM and the slope calculated at the entire frequency area of the mel-scaled filter bank spectrum.
The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:
Reference will now be made in detail to exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below to explain the present invention by referring to the figures.
Referring to
The blocking unit 220 reconfigures the voice data output from the filtering unit 210 in frame units by dividing the voice data into a constant time interval, each frame having a predetermined number of samples, and configures blocks, each block including a frame and a predetermined number of samples from the frame, for example, a 15 msec extended period. For example, if the size of a frame is 10 msec, the size of a block is 25 msec.
The first spectrum acquisitor 230 receives the voice data in units of blocks configured by the blocking unit 220 and obtains a mel-scaled filter bank spectrum of the voice data. This will be described in detail with reference to
The first parameter calculator 240 calculates a slope of the first spectrum X(k) output from the first spectrum acquisitor 230. This will be described in detail with reference to
Y(k)=aX(k)+b Equation 1
Slope a and constant b are obtained by using line fitting of the first order function. Technology related to the line fitting is described in “Numerical Recipes in FORTRAN 77, William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling, Feb. 1993,” but a detailed description is omitted. Since the obtained slope commonly has a negative value for a voiced sound, the obtained slope is adjusted to have a positive value by multiplying the obtained slope by −1, and the adjusted slope is set as a first parameter p1 for voiced sound discrimination.
As an embodiment for setting the first parameter p1, a first slope obtained at an entire filter bank zone can be used. As another embodiment for setting the first parameter p1, besides the first slope, second and third slopes obtained by dividing the entire filter bank zone into a low frequency band area and a high frequency band area and performing the line fitting on each area can be used. This will be described later with reference to
The second spectrum acquisitor 250 obtains a second spectrum Z(k) shown in
In this equation, Xm(k) indicates an average of the first spectrum X(k).
The second parameter calculator 260 calculates a spectral flatness measure (SFM) of the second spectrum output from the second spectrum acquisitor 250. The SFM can be defined as shown in Equation 3.
In this equation, GM indicates a geometric mean of the second spectrum Z(k), and AM indicates an arithmetic mean of the second spectrum Z(k), and they can be defined as shown in Equation 4.
In this equation, P indicates the number of used filter banks.
A second parameter p2 for unvoiced sound discrimination is calculated using the calculated SFM and slope as shown in Equation 5.
p2=SFM−λa Equation 5
In this equation, λ is a constant number indicating what percentage of the slope is reflected. A value of λ is approximately equal to 1. In the present exemplary embodiment, λ may preferably be equal to 0.75.
The determiner 270 respectively compares the first parameter p1 for voiced sound discrimination obtained by the first parameter calculator 240 to a first threshold value θ1 and the second parameter p2 for unvoiced sound discrimination obtained by the second parameter calculator 260 to a second threshold value θ2. The determiner 270 determines whether a voice signal of a relevant block indicates a voiced sound zone or an unvoiced sound zone according to the comparison result. The first threshold value θ1 and second threshold value θ2 are experimentally or empirically obtained in advance in the silent zone. A zone in which the first parameter p1 is larger than the first threshold value θ1 is determined as the voiced sound zone, and a zone in which the first parameter p1 is smaller than the first threshold value θ1 is determined as the unvoiced sound or the silent zone. That is, in the voiced sound zone, the slope a has a negative value, and in the unvoiced sound or the silent zone, the slope a has a positive value or a value near to 0. On the other hand, a zone in which the second parameter p2 is larger than the second threshold value θ2 is determined as the unvoiced sound zone, and a zone in which the second parameter p2 is smaller than the second threshold value θ2 is determined as the voiced sound or the silent zone. That is, in the voiced sound zone, the SFM is small and the slope a has a negative value, and in the unvoiced sound zone, the SFM and slope a are large, and in the silent zone, the SFM is small and the slope a is near to 0.
Referring to
In operation 630, the first spectrum X(k) is modeled as a first order function by applying line fitting, and a slope of the first order function is calculated as a first parameter p1 for voiced sound discrimination. In operation 640, a second spectrum Z(k) is obtained by removing the slope from the first spectrum X(k) obtained in operation 620.
In operation 650, an SFM is obtained from a geometric average and an arithmetic average of the second spectrum Z(k) obtained in operation 640, and a second parameter p2 for unvoiced sound discrimination is calculated from the slope of the first spectrum X(k) and the SFM of the second spectrum Z(k).
In operation 660, a zone having a value larger than a first threshold value in a waveform obtained by applying the first parameter p1 to the input signal of the block is determined as a voiced sound zone. In operation 670, a zone having a value larger than a second threshold value in a waveform obtained by applying the second parameter p2 to the input signal of the block is determined as an unvoiced sound zone.
Summarizing the comparison results, a voiced zone and an unvoiced zone can be more exactly detected from a pure voice signal without white noise and a voice signal including the white noise using a detection algorithm according to exemplary embodiments of the present invention.
In exemplary embodiments described above, a first parameter is set by multiplying a calculated slope by −1 in order to compare a waveform obtained by the first parameter and a waveform obtained by a second parameter. However, it does not matter that the calculated slope is set as the first parameter.
Exemplary embodiments may be embodied in a general-purpose computing devices by running a computer readable code from a medium, e.g. a computer-readable medium, including but not limited to storage media such as magnetic storage media (ROMs, RAMs, floppy disks, magnetic tapes, etc.), optically readable media (CD-ROMs, DVDs, etc.), and carrier waves (transmission over the Internet). Exemplary embodiments may be embodied as a medium having a computer readable program code unit embodied therein for causing a number of computer systems connected via a network to effect distributed processing. The network may be a wired network, a wireless network, or any combination thereof. Functional programs, codes and code segments for embodying the present invention may be easily deducted by programmers in the art, which the present invention belongs to.
As described above, according to exemplary embodiments of the present invention, since a voiced sound zone and an unvoiced sound zone are determined from an input signal in a block by dividing the input signal into units of predetermined size of blocks and using a spectral flatness measure (SFM) and slope of a mel-scaled filter bank spectrum obtained from the input signal existing in the block, an accuracy of discrimination between the voiced sound and the unvoiced sound is excellent, and more particularly, in a white noise environment, a performance of the discrimination is outstanding. Also, since a voiced sound zone and an unvoiced sound zone are determined using mel-scaled filter banks used for voice recognition, costly hardware or software does not have to be added, and accordingly, realizing costs are low-priced.
The apparatus, method, and medium for detecting a voiced sound zone and an unvoiced sound zone according to exemplary embodiments of the present invention can be applied to various fields such as voice detection for voice recognition, prosody information extraction for interactive voice recognition, voice encoding, and mingled noise removing.
While the above exemplary embodiments provide variable length coding of the input video data, it will be understood by those skilled in the art that fixed length coding of the input video data may be embodied from the spirit and scope of the invention.
Thus, although a few exemplary embodiments of the present invention have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2004-0008740 | Feb 2004 | KR | national |