This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2008-313607 filed on Dec. 9, 2008, the entire contents of which are incorporated herein by reference.
This invention relates to, in a voice communication system, a voice processing technique for changing an acoustic feature quantity of a received voice and making the received voice easy to hear.
For example, Japanese Patent Laid-Open Publication No. 9-152890 discloses, in the voice communication system, a method of, when a user desires low speed conversation, reducing the speaking speed of a received voice in accordance with the difference of the speaking speed between the received voice and a transmitted voice, whereby the received voice is made easy to hear.
A speed difference calculation part 704 detects a difference in speed between the speaking speeds calculated by the speaking speed calculation parts 701 and 703. A speaking speed conversion part 705 then converts the speaking speed of the receiving signal based on a control signal corresponding to the speed difference calculated by the speed difference calculation part 704 and outputs a signal, which is obtained by the conversion and serves as a received voice, from a speaker 706 including an amplifier.
When a predetermined receiving volume is used, a received voice is sometimes buried in ambient noise, and thus may be hard to hear. Therefore, in order to make the received voice easy to hear, a speaker should speak with a loud voice, or a hearer should manually adjust the receiving volume by, for example, turning up the volume. Thus, for example, Japanese Patent Laid-Open Publication No. 6-252987 discloses a method of automatically making a received voice easy to hear. In this method, the tendency that a hearer speaks generally louder when a received voice is hard to hear (Lombard effect) is used, and when a transmitted voice level is not less than a predetermined reference value, the receiving volume is increased, whereby the received voice is automatically made easy to hear.
In
A received voice amplifying part 809 controls an amplification degree of a received signal, which is received from the communication network 801 through the communication interface part 802, based on the control signal of the received voice level output from the received voice level management part 808.
The receiving part 806 then outputs a received voice from a speaker (not shown) based on the received signal with the controlled received voice level received from the received voice amplifying part 809.
A voice processing apparatus, which processes a first voice signal, includes: an acoustic analysis part which analyzes a feature quantity of an input second voice signal; a reference range calculation part which calculates a reference range based on the feature quantity; a comparing part which compares the feature quantity and the reference range and outputs a comparison result; and a voice processing part which processes and outputs the input first voice signal based on the comparison result.
Hereinafter, a best mode for carrying out the invention will be described in detail with reference to the drawings.
A reference range calculation part 102 performs statistic processing related to an average value and dispersion and the like, with respect to the feature quantity calculated by the acoustic analysis part 101, and calculates a reference range. A comparing part 103 compares the feature quantity calculated by the acoustic analysis part 101 and the reference range calculated by the reference range calculation part 102, and outputs the comparison result.
Based on the comparison result output by the comparing part 103, a voice processing part 104 applies a specific processing treatment to the signal of the input received voice, so that the received voice is processed to be easy to hear, and the voice processing part 104 then outputs the processed received voice. The specific processing treatment includes, for example, sound volume changes, speaking speed conversion, and/or a pitch conversion.
In
The voice processing part 104 includes an amplification factor determination part 1041 and an amplitude changing part 1042. The operation of the voice processing apparatus illustrated in
First, in the acoustic analysis part 101, when a signal of a transmitted voice is input (step S301 of
Next, the vowel detecting part 1012 detects a vowel part from the input transmitted voice, which is output from the time division part 1011 and has been time-divided into frame units, with the use of the vowel standard patterns stored in the vowel standard pattern dictionary part 1013. More specifically, the vowel detecting part 1012 calculates LPC (Linear Predictive Coding) cepstral coefficients of each frame obtained by division in the time division part 1011. The vowel detecting part 1012 then calculates, for each frame, a Euclidean distance between the LPC cepstral coefficients and each vowel standard pattern of the vowel standard pattern dictionary part 1013. Each of the vowel standard patterns is previously calculated from the LPC cepstral coefficient of each vowel and is stored in the vowel standard pattern dictionary part 1013. When the minimum value of the Euclidean distance is smaller than a specific threshold value, the vowel detecting part 1012 determines there is a vowel in the frame.
In parallel with the processing performed by the vowel detecting part 1012, the devoiced vowel detecting part 1014 detects a devoiced vowel portion from the input transmitted voice which is output from the time division part 1011 and time-divided into frame units. The devoiced vowel detecting part 1014 detects fricative consonants (such as /s/, /sh/, and /ts/) by zero crossing count analysis. When plosive consonants (such as /p/, /t/, and /k/) follow fricative consonants, the devoiced vowel detecting part 1014 determines there is a devoiced vowel in the input transmitted voice.
The speaking speed calculation part 1015 then counts the number of vowels and the devoiced vowels for a specific time based on the outputs of the vowel detecting part 1012 and the devoiced vowel detecting part 1014, whereby the speaking speed calculation part 1015 calculates the speaking speed (step S302 of
The reference range calculation part 102 outputs a reference range with respect to the speaking speed calculated by the acoustic analysis part 101 (step S303 of
Based on the comparison result output from the comparing part 103, the voice processing part 104 inputs the received voice (step S305 of
When the speaking speed is within the reference range, an update part 1022 updates the reference range (95% confidence interval from an average value) in accordance with the following formulae (1) to (4) with use of the speaking speed of the current frame (step S603 of
Reference range=[m−k×SE, m+k×SE] (1)
where the meanings of the symbols in the formulae (1) to (4) are as follows:
In the operation example of
In the second embodiment, the acoustic analysis part 101 calculates the speaking speed of the transmitted voice. In a third embodiment to be hereinafter described, the acoustic analysis part 101 calculates the pitch frequency. Hereinafter, the configuration of the third embodiment is similar to
For example, when a human exhales a large amount of air from the lungs for the purpose of raising his/her voice under a noisy environment, the vibration frequency of the vocal cord is increased, whereby the voice is naturally high-pitched. Thus, in the third embodiment, when the pitch frequency increases, the receiving volume is increased, whereby the received voice is made easy to hear.
A processing for calculating the pitch frequency of a transmitted voice in the acoustic analysis part 101 is illustrated as follows.
Pitch=freq/a_max (6),
wherein the meanings of the symbols in the formulae (5) and (6) are as follows:
As described above, the acoustic analysis part 101 calculates the correlated coefficient of the signal of the transmitted voice and divides the sampling frequency by the shifting position a corresponding to the correlated coefficient with the maximum value, whereby the pitch frequency is calculated.
The reference range calculation part 102 illustrated in
Subsequently, the comparing part 103 compares the pitch frequency calculated by the acoustic analysis part 101 and the reference range of the pitch frequency calculated by the reference range calculation part 102 and outputs the comparison result.
Based on the comparison result obtained by the comparing part 103, the voice processing part 104 then applies a specific processing treatment to the signal of the input received voice, so that the received voice is processed to be easy to hear, and the voice processing part 104 then outputs the processed received voice. The specific processing treatment includes, for example, sound volume changes, speaking speed conversion, and/or pitch conversion processing.
In a fourth embodiment to be hereinafter described, the acoustic analysis part 101 calculates a slope of the power spectrum. Hereinafter, the configuration of the fourth embodiment is similar to
According to the fourth embodiment, when a speaker wants to reduce a sound volume of the received voice, the speaker, for example, speaks in a muffled voice, whereby a high-frequency component is reduced, and the slope of the power spectrum is increased. Consequently, control may be performed so that the receiving volume is reduced.
The processing of calculating the slope of the power spectrum of a transmitted voice in the acoustic analysis part 101 is illustrated as follows:
The reference range calculation part 102 illustrated in
Subsequently, the comparing part 103 compares the slope of the power spectrum calculated by the acoustic analysis part 101 and the reference range of the slope of the power spectrum calculated by the reference range calculation part 102 and outputs the comparison result.
Based on the comparison result obtained by the comparing part 103, the voice processing part 104 then applies a specific processing treatment to the signal of the input received voice, so that the received voice is processed to be easy to hear, and the voice processing part 104 then outputs the processed received voice. The specific processing treatment includes, for example, sound volume changes, speaking speed conversion, and/or pitch conversion processing.
In a fifth embodiment to be hereinafter described, the acoustic analysis part 101 calculates an interval of a transmitted voice. Hereinafter, the configuration of the fifth embodiment is similar to
According to the fifth embodiment, when a speaker wants to increase the sound volume of a received voice, the speaker, for example, speaks in intervals, whereby control may be performed so that the interval is detected to increase the receiving volume.
The processing of calculating the interval of the transmitted voice in the acoustic analysis part 101 is illustrated as follows.
The reference range calculation part 102 illustrated in
Subsequently, the comparing part 103 compares the length of the interval calculated by the acoustic analysis part 101 and the reference range of the length of the interval calculated by the reference range calculation part 102 and outputs the comparison result. Based on the comparison result calculated by the comparing part 103, the voice processing part 104 then applies specific processing treatment to the signal of the input received voice, so that the received voice is processed to be easy to hear, and the voice processing part 104 then outputs the processed received voice. The specific processing treatment includes, for example, sound volume changes, speaking speed conversion, and/or pitch conversion processing.
In the second embodiment described above, the voice processing part 104 changes the sound volume of the received voice. In a sixth embodiment to be hereinafter described, the voice processing part 104 changes the speaking speed. Hereinafter, the configuration of the sixth embodiment is similar to
The speaking speed of a signal of a received voice changed by the voice processing part 104 may be realized by the configuration disclosed in, for example, Japanese Patent Laid-Open Publication No. 7-181998. Specifically, processing such that a time axis of a received voice waveform is compressed to increase the speaking speed is realized by the following configuration.
Namely, a pitch extraction part extracts a pitch period T from an input voice waveform, which is a received voice. A time-axis compression part creates and outputs a compression voice waveform from the input voice waveform based on the following first to sixth processes.
Meanwhile, the processing of expanding the time axis of the received voice waveform and reducing the speaking speed is realized by the following configuration.
Namely, the pitch extraction part extracts the pitch period T from the input voice waveform, which is a received voice. A time-axis expansion part creates and outputs an expansion voice waveform from the input voice waveform based on the following first to fifth processes.
In the second embodiment described above, the voice processing part 104 changes the sound volume of the received voice, and in the sixth embodiment described above, the voice processing part 104 changes the speaking speed of the received voice. In a seventh embodiment to be hereinafter described, the voice processing part 104 changes the pitch frequency. Hereinafter, the configuration of the seventh embodiment is similar to
The pitch frequency of a signal of a received voice changed by the voice processing part 104 may be realized by the configuration disclosed in, for example, Japanese Patent Laid-Open Publication No. 10-78791.
Specifically, a first pitch conversion part cuts out a phoneme waveform from a voice waveform, which is a received voice, and repeatedly outputs the phoneme waveform with a period corresponding to a first control signal.
A second pitch conversion part is connected to the input or output side of the first pitch conversion part, and the voice waveform is expanded and output in the time axis direction at a rate corresponding to a second control signal.
A control part then determines a desired pitch conversion ratio S0 and a conversion ratio F0 of a desired formant frequency based on the output of the comparing part 103 to give the conversion ratio FO as the second control signal to the second pitch conversion part. The control part further gives to the first pitch conversion part a signal as the first control signal which instructs the output performed with a period corresponding to S0/F0.
In the second embodiment described above, the voice processing part 104 changes the sound volume of the received voice. In the sixth embodiment described above, the voice processing part 104 changes the speaking speed of the received voice. In the seventh embodiment described above, the voice processing part 104 changes the pitch frequency of the received voice. In an eighth embodiment to be hereinafter described, the voice processing part 104 changes the length of the interval of the signal of a received voice. Hereinafter, the configuration of the eighth embodiment is similar to
The length of the interval of the signal of the received voice may be changed by the voice processing part 104 as follows, for example. Namely, the length of the interval of the received voice is changed by further addition of the interval after termination of the interval of the received voice. According to this configuration, a time delay occurs in the output of the next received voice; however, a long interval which is caused by the intake of a breath and is not less than a certain period of time is reduced, whereby the time delay is recovered.
In the second embodiment described above, the voice processing part 104 changes the sound volume of the received voice. In the sixth embodiment described above, the voice processing part 104 changes the speaking speed of the received voice. In the seventh embodiment described above, the voice processing part 104 changes the pitch frequency of the received voice. In the eighth embodiment, the voice processing part 104 changes the length of the interval of the signal of the received voice. In a ninth embodiment to be hereinafter described, the voice processing part 104 changes the slope of the power spectrum of the signal of a received voice. Hereinafter, the configuration of the ninth embodiment is similar to
The slope of the power spectrum of the signal of a received voice may be changed by the voice processing part 104 as follows, for example.
In the first to ninth embodiments, the received voice is processed to be made easy to hear in accordance with the feature quantity of the input transmitted voice; however, a previously recorded and stored voice is processed in accordance with the feature quantity of the transmitted voice of a user, whereby the stored voice may also be made easy to hear when reproduced.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present inventions have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2008-313607 | Dec 2008 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5781885 | Inoue et al. | Jul 1998 | A |
7672846 | Washio et al. | Mar 2010 | B2 |
Number | Date | Country |
---|---|---|
59-216242 | Dec 1984 | JP |
6-252987 | Sep 1994 | JP |
7-181998 | Jul 1995 | JP |
9-152890 | Jun 1997 | JP |
10-78791 | Mar 1998 | JP |
2004-219506 | Aug 2004 | JP |
Number | Date | Country | |
---|---|---|---|
20100082338 A1 | Apr 2010 | US |