This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2017-164725, filed on Aug. 29, 2017, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a method, an information processing apparatus for processing speech, and a non-transitory computer-readable storage medium.
During these years, many companies desire to obtain information regarding customers' (or employees') emotions from conversation between the employees and the customers in order to estimate the customers' satisfaction levels and improve the quality of marketing. Human emotions often exhibit through voice, and the pitch (pitch frequency) of voice is one of important factors in identifying human emotions.
An example of the related art for estimating a pitch frequency will be described.
The frequency conversion unit 10 is a processing unit that calculates a frequency spectrum of an input speech sound by performing a Fourier transform on the input speech sound. The frequency conversion unit 10 outputs the frequency spectrum of the input speech sound to the correlation calculation unit 11. In the following description, a frequency spectrum of an input speech sound will be referred to as an “input spectrum”.
The correlation calculation unit 11 is a processing unit that calculates correlation values between cosine waves of various frequencies and an input spectrum. The correlation calculation unit 11 outputs information in which the frequencies of the cosine waves and the correlation values are associated with each other to the identification unit 12.
The identification unit 12 is a processing unit that outputs the frequency of a cosine wave associated with a largest correlation value as a pitch frequency.
Cosine waves 6a and 6b are part of cosine waves received by the correlation calculation unit 11. The cosine wave 6a has peaks thereof at a frequency f [Hz] and multiples of the frequency f [Hz] along the frequency axis. The cosine wave 6b has peaks thereof at a frequency 2f [Hz] and multiples of the frequency 2f [Hz].
The correlation calculation unit 11 calculates a correlation value of 0.95 between the input spectrum 5a and the cosine wave 6a. The correlation calculation unit 11 also calculates a correlation value of 0.40 between the input spectrum 5a and the cosine wave 6b.
The identification unit 12 compares correlation values and identifies a largest correlation value. In the example illustrated in
Examples of the related art include Japanese National Publication of International Patent Application No. 2002-516420 and Japanese National Publication of International Patent Application No. 2002-515609.
According to an aspect of the invention, a method for processing speech includes: executing a acquiring process that includes acquiring a speech signal; executing a detection process that includes detecting a first frequency spectrum from the speech signal; executing a calculation process that includes calculating a second spectrum based on an envelope of the first spectrum; executing a correction process that includes correcting the first spectrum based on comparison between a first amplitude of the first spectrum and a second amplitude of the second spectrum; executing a estimation process that includes estimating a pitch frequency of the speech signal in accordance with correlation between the corrected first frequency spectrum and periodic signals corresponding to frequencies in a certain band.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
There is a problem in the above-described example of the related art in that it is difficult to improve the accuracy of estimating a pitch frequency.
There might be decreases in an input spectrum or some harmonics, for example, due to telephone bandwidth shaping, a surrounding environment, or the like. In this case, it is difficult to accurately estimate a pitch frequency.
The correlation calculation unit 11 calculates a correlation value of 0.70 between the input spectrum 5b and the cosine wave 6a. The correlation calculation unit 11 also calculates a correlation value of 0.80 between the input spectrum 5b and the cosine wave 6b.
The identification unit 12 compares correlation values and identifies a largest correlation value. In the example illustrated in
Although the amplitude of the input spectrum 5b at the frequency f is smaller than an appropriate value, a frequency corresponding to a local maximum in a low band is f. A correct pitch frequency, therefore, is f. The pitch frequency output from the identification unit 12 is thus incorrect.
An aspect of the present disclosure aims to provide a speech processing program, a method for processing speech, and a speech processing apparatus capable of improving the accuracy of estimating a pitch frequency.
Embodiments of a speech processing program, a method for processing speech, and a speech processing apparatus disclosed in the present disclosure will be described in detail hereinafter based on the drawings. The present disclosure is not limited by the embodiments.
[First Embodiment]
The microphone 50a is a device that inputs information regarding a collected speech sound to the speech processing apparatus 100. In the following description, information regarding a speech sound input from the microphone 50a to the speech processing apparatus 100 will be referred to as a “speech signal”. A speech signal is an example of an input speech sound.
The display unit 50b is a display device that displays information output from the speech processing apparatus 100. The display unit 50b corresponds to a liquid crystal display, a touch panel, or the like.
The A/D conversion unit 110 is a processing unit that receives a speech signal from the microphone 50a and that performs A/D conversion. More specifically, the ND conversion unit 110 converts a speech signal (analog signal) into a speech signal (digital signal). The A/D conversion unit 110 outputs the speech signal (digital signal) to the speech file obtaining unit 115 and the detection unit 120. In the following description, a speech signal (digital signal) output from the A/D conversion unit 110 will be simply referred to as a “speech signal”.
The speech file obtaining unit 115 is a processing unit that converts a speech sound into a speech file using a certain speech file format. A speech file includes, for example, information in which time points and the strength of a speech signal are associated with each other. The speech file obtaining unit 115 stores the speech file in a speech file table 160a of the storage unit 160.
The detection unit 120 is a processing unit that detects a frequency spectrum from a speech signal (can also be referred to as an “input speech sound”). The detection unit 120 outputs information regarding the frequency spectrum to the calculation unit 130 and the correction unit 140. In the following description, a frequency spectrum detected from a speech signal will be referred to as an “input spectrum” (can also be referred to as a “first frequency spectrum”).
The detection unit 120 performs a short-time discrete Fourier transform (STFT) on speech signals x(t-T) to x(t) corresponding to frames to detect input spectra X(l, k). The length of each frame is a predetermined length T.
The variables t, l, k, x(t), and x(l, k) will be described. t is a variable indicating time. l is a variable indicating a frame number. k is a variable indicating a band [bin] (k=0, 1, . . . . T−1). x(t) denotes an n-th speech signal. X(l, k) denotes an n-th input spectrum.
The calculation unit 130 is a processing unit that calculates a reference spectrum (can also be referred to as a “second frequency spectrum”) based on an envelope of an input spectrum. The calculation unit 130 calculates a reference spectrum by, for example, smoothing an input spectrum X(l, k) in a frequency direction. The calculation unit 130 outputs information regarding the reference spectrum to the correction unit 140.
For example, the calculation unit 130 uses a Hamming window W(m) of a filter length Q in order to smooth an input spectrum X(l, k) in the frequency direction. The Hamming window W(m) is defined by expression (1). The variable m corresponds to the band [bin] at a time when the Hamming window is disposed on an input spectrum.
The calculation unit 130 obtains a reference spectrum based on expression (2). Although a case in which the Hamming window is used will be described as an example, a Gaussian window or a Blackman window may be used, instead.
The correction unit 140 is a processing unit that corrects an input spectrum based on comparison between the amplitude of the input spectrum (can also be referred to as a “first amplitude”) and the amplitude of a reference spectrum (can also be referred to as a “second amplitude”). In the following description, a corrected input spectrum will be referred to as a “corrected spectrum” (can also be referred to as a “third frequency spectrum”). The correction unit 140 outputs information regarding the corrected spectrum to the estimation unit 150.
The correction unit 140 calculates a difference D(l, k) between an input spectrum and a reference spectrum based on expression (3). In
D(1,k)=X(1,k)−
The correction unit 140 calculates a corrected spectrum Y(l, k) by substituting D(l, k), which indicates a differential spectrum, into expression (4). In expression (4), g(D(l, k)) is a predetermined function.
Y(1,k)=g(D(1,k)) (4)
As illustrated in
Returning to
Expression (5) indicates periodic signals used by the estimation unit 150. Although cosine waves are used as periodic signals here, periodic signals other than cosine waves may be used, instead. In expression (5), a variable p is a≤p≤b. a and b are predetermined values corresponding to, for example, the number of bins in 50 to 1,000 Hz.
S(p,k)=cos(2*πk/p) (5)
The estimation unit 150 calculates correlation values C(p) between the corrected spectrum Y(l, k) and periodic signals S(p, k) based on expression (6). The estimation unit 150 calculates the correlation values C(p) corresponding to p while changing p from a to b.
The estimation unit 150 calculates a maximum value M based on expression (7). The estimation unit 150 estimates, as a pitch frequency P, a value of p with which the maximum value M is obtained. When the maximum value M is equal to or larger than a threshold TH, the estimation unit 150 outputs the pitch frequency P. When the maximum value M is smaller than the threshold TH, the estimation unit 150 outputs the pitch frequency P as 0.
M=max(C(p),a≤p≤b) (7)
The estimation unit 150 repeatedly performs the above process for each frame, associates frame numbers and pitch frequencies with each other, and registers the frame numbers and the pitch frequencies to the pitch frequency table 160b.
The storage unit 160 includes the speech file table 160a and the pitch frequency table 160b. The storage unit 160 corresponds to a semiconductor memory such as a random-access memory (RAM), a read-only memory (ROM), or a flash memory or a storage device such as a hard disk drive (HDD).
The speech file table 160a holds speech files output from the speech file obtaining unit 115.
The pitch frequency table 160b holds information regarding pitch frequencies output from the estimation unit 150. The pitch frequency table 160b associates, for example, frame numbers and pitch frequencies with each other.
The output unit 170 is a processing unit that outputs screen information regarding pitch frequencies to the display unit 50b to display the screen information on the display unit 50b.
Alternatively, the output unit 170 may evaluate a speech signal based on a plurality of pitch frequencies stored in the pitch frequency table 160b and display a result of the evaluation on the screen information 60. If a difference between two selected pitch frequencies is equal to or larger than a threshold, for example, it indicates that corresponding speech has a pleasant lilt, and the output unit 170 sets a result 60a of evaluation, namely “Good!”, to the screen information 60. The output unit 170 may also perform evaluation based on a table (not illustrated) in which characteristics of changes in the pitch frequency and results of evaluation.
The A/D conversion unit 110, the speech file obtaining unit 115, the detection unit 120, the calculation unit 130, the correction unit 140, the estimation unit 150, and the output unit 170 illustrated in
Next, an example of a processing procedure used by the speech processing apparatus 100 according to the first embodiment will be described.
The calculation unit 130 of the speech processing apparatus 100 calculates a reference spectrum (step S103). The correction unit 140 of the speech processing apparatus 100 corrects the input spectrum to calculate a corrected spectrum (step S104).
The estimation unit 150 of the speech processing apparatus 100 calculates correlation values between the corrected spectrum and periodic signals corresponding to frequencies in certain bands (step S105). The estimation unit 150 estimates a pitch frequency at which a maximum value of the correlation values is obtained based on the correlation values (step S106).
The output unit 170 of the speech processing apparatus 100 evaluates the speech signal based on pitch frequencies (step S107). The output unit 170 generates screen information and outputs the screen information to the display unit 50b (step S108).
The speech processing apparatus 100 determines whether a speech sound has ended (step S109). If the speech sound has not ended (NO in step S109), the speech processing apparatus 100 returns to step S101. If the speech sound has not ended (YES in step S109), on the other hand, the speech processing apparatus 100 ends the process.
Next, advantageous effects produced by the speech processing apparatus 100 according to the first embodiment will be described. The speech processing apparatus 100 calculates a reference spectrum based on an envelope of an input spectrum of a speech signal and calculates a corrected spectrum by comparing the input spectrum and the reference spectrum. The speech processing apparatus 100 estimates a pitch frequency of the speech signal based on correlation values between the corrected spectrum and periodic signals corresponding to frequencies in certain bands. Since the corrected spectrum indicates local maxima of the input spectrum with the same amplitude, decreases in the input spectrum or some harmonics do not affect the correlation values, insofar as the local maxima are maintained. As a result, the accuracy of estimating a pitch frequency improves.
The speech processing apparatus 100 according to the first embodiment, on the other hand, corrects the input spectrum 7a to calculate the corrected spectrum 9a and estimates a pitch frequency by calculating correlation values between the corrected spectrum 9a and the periodic signals. The corrected spectrum 9a indicates local maxima with the same amplitude even if there are decreases in input spectrum 7a or some harmonics. It is therefore possible to appropriately obtain the pitch frequency even if there are decreases in the input spectrum 7a or some harmonics, insofar as the local maxima are maintained. In the example illustrated in
Although the calculation unit 130 of the speech processing apparatus 100 according to the first embodiment calculates a reference spectrum by smoothing an input spectrum in the frequency direction, a reference spectrum may be calculated by another process, instead.
The calculation unit 130 may calculate a reference spectrum using a process other than that illustrated in
[Second Embodiment]
A speech sound uttered by a speaker 1a is collected by the microphone of the mobile terminal 2a, and an obtained speech signal is transmitted to the recording apparatus 66 through the branch connector 3. In the following description, a speech signal of the speaker la will be referred to as a “first speech signal”.
A speech sound uttered by a speaker 1b is collected by the microphone of the terminal apparatus 2b, and an obtained speech signal is transmitted to the recording apparatus 66 through the branch connector 3. In the following description, a speech signal of the speaker 1b will be referred to as a “second speech signal”.
The recording apparatus 66 records the first and second speech signals. Upon receiving the first speech signal, for example, the recording apparatus 66 converts the first speech signal into a speech file using a certain speech file format and transmits the speech file of the first speech signal to the speech processing apparatus 200. In the following description, a speech file of a first speech signal will also be referred to as a “first speech file”.
Upon receiving the second speech signal, the recording apparatus 66 converts the second speech signal into a speech file using a certain speech file format and transmits the speech file of the second speech signal to the speech processing apparatus 200. In the following description, a speech file of a second speech signal will also be referred to as a “second speech file”.
The speech processing apparatus 200 estimates a pitch frequency of a first speech signal of a first speech file. The speech processing apparatus 200 also estimates a pitch frequency of a second speech signal of a second speech file. Because a process for estimating a pitch frequency of a first speech signal and a process for estimating a pitch frequency of a second speech signal are the same, the process for estimating a pitch frequency of a first speech signal will be described hereinafter. The first and second speech signals will be generically referred to as “speech signals” in the following description.
The reception unit 210 is a processing unit that receives a speech file from the recording apparatus 66. The reception unit 210 registers the received speech file to a speech file table 220a of the storage unit 220. The reception unit 210 corresponds to a communication device.
The storage unit 220 includes the speech file table 220a and a pitch frequency table 220b. The storage unit 220 corresponds to a semiconductor memory such as a RAM, a ROM, or a flash memory or a storage device such as an HDD.
The detection unit 230 is a processing unit that obtains a speech file (speech signal) from the speech file table 220a and that detects an input spectrum (frequency spectrum) from the obtained speech signal. The detection unit 230 outputs information regarding the detected input spectrum to the calculation unit 240 and the correction unit 250. A process for detecting an input spectrum from a speech signal performed by the detection unit 230 is the same as that performed by the detection unit 120 according to the first embodiment.
The calculation unit 240 is a processing unit that calculates a reference spectrum based on an envelope of an input spectrum. The calculation unit 240 outputs information regarding the reference spectrum to the correction unit 250. A process for calculating a reference spectrum based on an input spectrum is the same as that performed by the calculation unit 130 according to the first embodiment.
The correction unit 250 is a processing unit that corrects an input spectrum based on comparison between the amplitude of the input spectrum and the amplitude of a reference spectrum. A process for correcting an input spectrum and calculating a corrected spectrum performed by the correction unit 250 is the same as that performed by the correction unit 140 according to the first embodiment. The correction unit 250 outputs information regarding a corrected spectrum to the estimation unit 260.
The estimation unit 260 is a processing unit that estimates a pitch frequency of a speech signal based on correlation between a corrected spectrum and periodic signals corresponding to frequencies in certain bands. The estimation unit 260 calculates correlation values C(p) between the corrected spectrum and the periodic signals and identifies a frequency p at which a maximum value M of the correlation values C(p) is obtained in the same manner as the estimation unit 150 according to the first embodiment. In the following description, the frequency p at which the maximum value M of the correlation values C(p) is obtained will be denoted by P.
Furthermore, if the following first and second conditions are satisfied, the estimation unit 260 determines the frequency P as the pitch frequency. If either the first condition or the second condition is not satisfied, on the other hand, the estimation unit 260 outputs the pitch frequency as 0. X(l, P) in the second condition denotes the amplitude of an input spectrum whose frame number is “l”, which is a current analysis target, at the frequency P.
First condition: The maximum value M be equal to or larger than a threshold TH1.
Second condition: X(l, P), X(l, 2P), and X(l, 3P) be equal to or larger than a threshold TH2.
The estimation unit 260 associates the frame number and the pitch frequency with each other and registers the frame number and the pitch frequency to the pitch frequency table 220b.
The detection unit 230, the calculation unit 240, the correction unit 250, and the estimation unit 260 repeatedly perform the above process while updating a position at which a speech file is analyzed. If a current analysis start position is denoted by u, for example, a next analysis start position is u+T. T denotes a predetermined length of each frame.
Next, an example of a processing procedure used by the speech processing apparatus 200 according to the second embodiment will be described.
The detection unit 230 detects an input spectrum (step S203). The calculation unit 240 of the speech processing apparatus 200 calculates a reference spectrum (step S204). The correction unit 250 of the speech processing apparatus 200 corrects an input spectrum to calculate a corrected spectrum (step S205).
The estimation unit 260 of the speech processing apparatus 200 calculates correlation values between the corrected spectrum and periodic signals corresponding to frequencies in certain bands (step S206). The estimation unit 260 estimates a pitch frequency at which a maximum value of the correlation values is obtained based on the correlation values (step S207). If the first and second conditions are satisfied in step S207, the estimation unit 260 estimates the frequency at which the maximum value of the correlation values is obtained as the pitch frequency.
The speech processing apparatus 200 determines whether a speech sound has ended (step S208). If the speech sound has not ended (NO in step S208), the speech processing apparatus 200 updates the analysis start position (step S209) and returns to step S203. If the speech sound has ended (YES in step S208), on the other hand, the speech processing apparatus 200 ends the process.
Next, advantageous effects produced by the speech processing apparatus 200 according to the second embodiment will be described. The speech processing apparatus 200 estimates a pitch frequency of a speech signal based on correlation values between a corrected spectrum and periodic signals corresponding to frequencies in certain bands. Since the corrected spectrum indicates local maxima of the input spectrum with the same amplitude, decreases in the input spectrum or some harmonics do not affect the correlation values, insofar as the local maxima are maintained. As a result, the accuracy of estimating a pitch frequency improves.
In addition, the speech processing apparatus 200 corrects a pitch frequency based on the amplitude of an input spectrum corresponding to integral multiples of the pitch frequency. If X(l, P), X(l, 2P), and X(l, 3P) are equal to or larger than the threshold TH2, for example, a position of the pitch frequency P in the input spectrum corresponds to a position of a local maximum, and the pitch frequency P is appropriate. The pitch frequency P, therefore, is output as it is. If X(l, P), X(l, 2P), and X(l, 3P) are smaller than the threshold TH2, on the other hand, the position of the pitch frequency P is deviated from a position of a local maximum, and the pitch frequency P is not appropriate. The above process, therefore, is performed, and only pitch frequencies that have been determined to be appropriate are output. 0 may be output for other pitch frequencies.
[Third Embodiment]
A speech sound uttered by a speaker 1A is collected by the microphone 30a, and an obtained speech signal is output to the speech processing apparatus 300. A speech sound uttered by a speaker 1B is collected by the microphone 30b, and an obtained speech signal is output to the speech processing apparatus 300. A speech sound uttered by a speaker 1C is collected by the microphone 30c, and an obtained speech signal is output to the speech processing apparatus 300.
In the following description, a speech signal of the speaker 1A will be referred to as a “first speech signal”. A speech signal of the speaker 1B will be referred to as a “second speech signal”. A speech signal of the speaker 1C will be referred to as a “third speech signal”.
Speaker information regarding the speaker 1A, for example, is added to the first speech signal. Speaker information is information for uniquely identifying a speaker. Speaker information regarding the speaker 1B is added to the second speech signal. Speaker information regarding the speaker 1C is added to the third speech signal.
The speech processing apparatus 300 records the first to third speech signals. The speech processing apparatus 300 also performs a process for detecting pitch frequencies of the speech signals. The speech processing apparatus 300 associates the speaker information and a pitch frequency in each certain frame with each other and transmits the speaker information and the pitch frequency to the server 400.
The server 400 stores the pitch frequencies of the speaker information received from the speech processing apparatus 300.
The A/D conversion unit 310a is a processing unit that receives a first speech signal from the microphone 30a and that performs A/D conversion. More specifically, the A/D conversion unit 310a converts a first speech signal (analog signal) into a first speech signal (digital signal). The A/D conversion unit 310a outputs the first speech signal (digital signal) to the pitch detection unit 320. In the following description, the first speech signal (digital signal) output from the A/D conversion unit 310a will be simply referred to as a “first speech signal”.
The A/D conversion unit 310b is a processing unit that receives a second speech signal from the microphone 30b and that performs ND conversion. More specifically, the A/D conversion unit 310b converts a second speech signal (analog signal) into a second speech signal (digital signal). The A/D conversion unit 310b outputs the second speech signal (digital signal) to the pitch detection unit 320. In the following description, the second speech signal (digital signal) output from the A/D conversion unit 310b will be simply referred to as a “second speech signal”.
The A/D conversion unit 310c is a processing unit that receives a third speech signal from the microphone 30c and that performs A/D conversion. More specifically, the A/D conversion unit 310c converts a third speech signal (analog signal) into a third speech signal (digital signal). The A/D conversion unit 310c outputs the third speech signal (digital signal) to the pitch detection unit 320. In the following description, the third speech signal (digital signal) output from the A/D conversion unit 310c will be simply referred to as a “third speech signal”.
The pitch detection unit 320 is a processing unit that calculates a pitch frequency in each certain frame by conducting a frequency analysis on a speech signal. For example, the pitch detection unit 320 conducts a frequency analysis on a first speech signal to detect a first pitch frequency of the first speech signal. The pitch detection unit 320 conducts a frequency analysis on a second speech signal to detect a second pitch frequency of the second speech signal. The pitch detection unit 320 conducts a frequency analysis on a third speech signal to detect a third pitch frequency of the third speech signal.
The pitch detection unit 320 associates the speaker information regarding the speaker 1A and the first pitch frequency in each certain frame with each other and outputs the speaker information and the first pitch frequency to the file obtaining unit 330. The pitch detection unit 320 associates the speaker information regarding the speaker 1B and the second pitch frequency in each certain frame with each other and outputs the speaker information and the second pitch frequency to the file obtaining unit 330. The pitch detection unit 320 associates the speaker information regarding the speaker 1C and the third pitch frequency in each certain frame with each other and outputs the speaker information and the third pitch frequency to the file obtaining unit 330.
The file obtaining unit 330 is a processing unit that generates speech file information by converting information received from the pitch detection unit 320 into a file. The speech file information includes information in which speaker information and a pitch frequency in each certain frame are associated with each other. More specifically, the speech file information includes the speaker information regarding the speaker 1A and the first pitch frequency in each certain frame are associated with each other. The speech file information includes the speaker information regarding the speaker 1B and the second pitch frequency in each certain frame are associated with each other. The speech file information includes the speaker information regarding the speaker 1C and the third pitch frequency in each certain frame are associated with each other. The file obtaining unit 330 outputs the speech file information to the transmission unit 340.
The transmission unit 340 obtains the speech file information from the file obtaining unit 330 and transmits the obtained speech file information to the server 400.
Next, the configuration of the pitch detection unit 320 illustrated in
The detection section 321 is a processing section that obtains a speech signal and that detects an input spectrum (frequency spectrum) from the obtained speech signal. The detection section 321 outputs information regarding the detected input spectrum to the calculation section 322 and the correction section 323. A process for detecting an input spectrum from a speech signal performed by the detection section 321 is the same as that performed by the detection unit 120 according to the first embodiment.
The calculation section 322 is a processing section that calculates a reference spectrum based on an envelope of an input spectrum. The calculation section 322 outputs information regarding a reference spectrum to the correction section 323. A process for calculating a reference spectrum based on an input spectrum performed by the calculation section 322 may be the same as that performed by the calculation unit 130 according to the first embodiment, or a reference spectrum may be calculated by performing the following process.
The calculation section 322 calculates an ensemble mean AVE of the input spectrum X(l, k) based on expression (8).
The calculation section 322 selects only local maxima larger than the ensemble average AVE and calculates a spectrum 17 by performing linear interpolation on the selected local maxima. It is assumed, for example, that the local maxima Lm1 to Lm4 are larger than the ensemble mean AVE. The calculation section 322 calculates a reference spectrum by translating an envelope of the spectrum 17 by −J1 [dB] in an amplitude direction.
The correction section 323 is a processing unit that corrects an input spectrum based on comparison between the amplitude of the input spectrum and the amplitude of a reference spectrum. A process for correcting an input spectrum and calculating a corrected spectrum performed by the correction section 323 is the same as that performed by the correction unit 140 according to the first embodiment. The correction section 323 outputs information regarding the corrected spectrum to the estimation section 324.
The estimation section 324 is a processing unit that estimates a pitch frequency of a speech signal based on correlation between a corrected spectrum and periodic signal corresponding to frequencies in certain bands. The estimation section 324 calculates correlation values C(p) between a corrected spectrum and periodic signals and identifies a frequency p at which a maximum value M of the correlation values C(p) is obtained in the same manner as the estimation unit 150 according to the first embodiment. In the following description, the frequency p at which the maximum value M of the correlation values C(p) is obtained will be denoted by P.
Furthermore, if the following third and fourth conditions are satisfied, the estimation section 324 determines the frequency P as the pitch frequency. If either the third condition or the fourth condition is not satisfied, on the other hand, the estimation section 324 outputs the pitch frequency as 0.
Third condition: The maximum value M be equal to or larger than the threshold TH1.
Fourth condition: Any of differences (P−P1, P−P2, . . . , and P−Pq) between a plurality of pitch frequencies (P1, P2, . . . , and Pq) estimated in a certain period (for example, in past q frames) and the frequency P be smaller than a threshold TH3.
The estimation section 324 associates speaker information regarding a speaker and a pitch frequency with each other and outputs the speaker information and the pitch frequency to the file obtaining unit 330. Each time the estimation section 324 estimates a pitch frequency, the estimation section 324 stores information regarding the estimated pitch frequency in the storage section 325. For example, the estimation section 324 may sequentially store information regarding an estimated pitch frequency in the storage section 325.
The storage section 325 stores information regarding pitch frequencies. The storage section 325 corresponds to a semiconductor memory such as a RAM, a ROM, or a flash memory or a storage device such as an HDD.
Next, an example of a processing procedure used by the pitch detection unit 320 according to the third embodiment will be described.
The estimation section 324 of the pitch detection unit 320 calculates correlation values between the corrected spectrum and periodic signals corresponding to frequencies in certain bands (step S305). The estimation section 324 estimates, based on the correlation values, a pitch frequency at which a maximum value of the correlation values is obtained (step S306).
The pitch detection unit 320 determines whether a speech sound has ended (step S307). If the speech sound has not ended (NO in step S307), the pitch detection unit 320 returns to step S301. If the speech sound has ended (YES in step S307), on the other hand, the pitch detection unit 320 ends the process.
Next, advantageous effects produced by the speech processing apparatus 300 according to the third embodiment will be described. The speech processing apparatus 300 estimates a pitch frequency of a speech signal based on correlation values between a corrected spectrum and periodic signals corresponding to frequencies in certain bands. Since the corrected spectrum indicates local maxima of the input spectrum with the same amplitude, decreases in the input spectrum or some harmonics do not affect the correlation values, insofar as the local maxima are maintained. As a result, the accuracy of estimating a pitch frequency improves.
In addition, if pitch frequencies output in past q frames are denoted by P1, P2, . . . , and Pq, and if any of P−P1, P−P2, . . . , and P−Pq is smaller than the threshold TH3, the speech processing apparatus 300 outputs the pitch frequency P. If the pitch frequency P is deviated due to noise or the like, for example, the above condition is not satisfied, and the incorrect pitch frequency P is not output.
Next, an example of the hardware configuration of a computer that achieves the same functions as those of the speech processing apparatuses 100, 200, and 300 according to the above embodiments will be described.
As illustrated in
The hard disk device 508 includes a detection program 508a, a calculation program 508b, a correction program 508c, and an estimation program 508d. The CPU 501 reads the detection program 508a, the calculation program 508b, the correction program 508c, and the estimation program 508d and loads the detection program 508a, the calculation program 508b, the correction program 508c, and the estimation program 508d into the RAM 507.
The detection program 508a functions as a detection process 507a. The calculation program 508b functions as a calculation process 507b. The correction program 508c functions as a correction process 507c. The estimation program 508d functions as an estimation process 507d.
The detection process 507a corresponds to the process performed by the detection units 120 and 230 and the detection section 321. The calculation process 507b corresponds to the process performed by the calculation units 130 and 240 and the calculation section 322. The correction process 507c corresponds to the process performed by the correction units 140 and 250 and the correction section 323. The estimation process 507d corresponds to the process performed by the estimation units 150 and 260 and the estimation section 324.
The programs 508a to 508d do not have to be stored in the hard disk device 508 in advance. For example, the programs 508a to 508d may be stored in a portable physical medium such as a flexible disk (FD), a compact disc read-only memory (CD-ROM), a digital versatile disc (DVD), a magneto-optical (MO) disk, or an integrated circuit (IC) card insertable into the computer 500. The computer 500 may then read and execute the programs 508a to 508d.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2017-164725 | Aug 2017 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4637046 | Sluijter | Jan 1987 | A |
5189701 | Jain | Feb 1993 | A |
5799276 | Komissarchik | Aug 1998 | A |
6526376 | Villette | Feb 2003 | B1 |
6885986 | Gigi | Apr 2005 | B1 |
20050049875 | Kawashima | Mar 2005 | A1 |
20100318350 | Endo | Dec 2010 | A1 |
20130041669 | Ben-David | Feb 2013 | A1 |
20160062982 | Wroczynski | Mar 2016 | A1 |
20160111084 | Bang | Apr 2016 | A1 |
Number | Date | Country |
---|---|---|
2002-515609 | May 2002 | JP |
2002-516420 | Jun 2002 | JP |
Number | Date | Country | |
---|---|---|---|
20190066714 A1 | Feb 2019 | US |