This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2012-270916 filed on Dec. 12, 2012, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to, for example, a voice processing device configured to control an input signal, a voice processing method, and a voice processing program.
A method is known to control a voice signal given as an input signal such that the voice signal is easy to listen. For example, for aged people, a voice recognition ability may be degraded due to a reduction in hearing ability or the like with aging. Therefore, it tends to become difficult for aged people to hear voices when a talker speaks at a high speech rate in a two-way voice communication using a portable communication terminal or the like. A simplest way to handle the above situation is that a talker speaks “slowly” and “clearly”, as disclosed, for example, in Tomono Miki et al., “Development of Radio and Television Receiver with Speech Rate Conversion Technology”, CASE#10-03, Institute of Innovation Research, Hitotsubashi University, April, 2010. In other words, it is effective that a talker speaks slowly word by word with a clear pause between words and between phrases. However, in two-way voice communications, it may be difficult to ask a talker, who usually speaks fast, to intentionally speak “slowly” and “clearly”. In view of the above situation, for example, Japanese Patent No. 4460580 discloses a technique in which voice segments of a received voice signal are detected and extended to improve audibility thereof, and furthermore, non-voice segments are shortened to reduce a delay caused by the extension of voice segments. More specifically, when an input signal is given, a voice segment, that is, an active speech segment and a non-voice segment, that is, a non-speech segment in the given input signal are detected, and voice samples included in the voice segment are repeated periodically thereby controlling the speech rate to be lowered without changing the speech pitch of a received voice and thus achieving an improvement in easiness of listening. Furthermore, by shortening a non-voice segment between voice segments, it is possible to minimize a delay caused by the extension of the voice segments so as to suppress sluggishness resulting from the extension of the voice segments thereby allowing the two-way voice communication to be natural.
In accordance with an aspect of the embodiments, a voice processing device includes: a processor; and a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute, receiving a first signal including a plurality of voice segments; controlling such that a non-voice segment with a length equal to or greater than a predetermined first threshold value exists between at least one of the plurality of voice segments; and outputting a second signal including the plurality of voice segments and the controlled non-voice segment.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
The voice processing device disclosed in the present description is capable of improving the easiness for a listener to hear a voice.
These and/or other aspects and advantages will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawing of which:
Embodiments of a voice processing device, a voice processing method, and a voice processing program are described in detail below with reference to drawings. Note that the embodiments described below are only for illustration and not for limitation.
In the above-described method of controlling the speech rate, only a reduction in speech rate is taken into account, and no consideration is taken on an improvement of clarity of voices by making a clear pause in speech, and thus the above-described method is not sufficient in terms of improvement in audibility. Furthermore, in the above-described technique of controlling the speech rate, non-voice segments are simply reduced regardless of whether there is ambient noise on a near-end side where a listener is located. However, in a case where a two-way communication is performed in a situation in which a listener is in a noisy environment (in which there is ambient noise), the ambient noise may make it difficult to hear a voice.
In view of the above, the inventors have contemplated factors that may make it difficult to hear voices in two-way communications in an environment in which there is noise at a receiving side where a near-end signal is generated, as described below. As illustrated in
First Embodiment
The receiving unit 2 is realized, for example, by a wired logic hardware circuit. Alternatively, the receiving unit 2 may be a function module realized by a computer program executed in the voice processing device 1. The receiving unit 2 acquires, from the outside, a near-end signal transmitted from a receiving side (a user of the voice processing device 1) and a first remote-end signal including an uttered voice transmitted from a transmitting side (a person communicating with the user of the voice processing device 1). The receiving unit 2 may receive the near-end signal, for example, from a microphone (not illustrated) connected to or disposed in the voice processing device 1. The receiving unit 2 may receive the first remote-end signal via a wired or wireless circuit, and may decode the first remote-end signal using decoder unit (not illustrated) connected to or disposed in the voice processing device 1. The receiving unit 2 outputs the received first remote-end signal to the detection unit 3 and the control unit 5. The receiving unit 2 outputs the received near-end signal to the calculation unit 4. Here, it is assumed by way of example that the first remote-end signal and the near-end signal are input to the receiving unit 2, for example, in units of frames each having a length of about 10 to 20 milliseconds and each including a particular number of voice samples (or ambient noise samples). The near-end signal may include ambient noise at the receiving side.
The detection unit 3 is realized, for example, by a wired logic hardware circuit. Alternatively, the detection unit 3 may be a function module realized by a computer program executed in the voice processing device 1. The detection unit 3 receives the first remote-end signal from the receiving unit 2. The detection unit 3 detects a non-voice segment length and a voice segment length included in the first remote-end signal. The detection unit 3 may detect a non-voice segment length and a voice segment length, for example, by determining whether each frame in the first remote-end signal is in a voice segment or a non-voice segment. An example of a method of determining whether a given frame is a voice segment or a non-voice segment is to subtract an average power of input voice sample calculated for past frames from a voice sample power of the current frame thereby determining a difference in power, and compare the difference in power with a threshold value. When the difference is equal to or greater than the threshold value, the current frame is determined as a voice segment, but when the difference is smaller than the threshold value, the current frame is determined as a non-voice segment. The detection unit 3 may add associated information to the detected voice segment length and the non-voice segment length in the first remote-end signal. More specifically, for example, the detection unit 3 may add associated information to the detected voice segment length in the first remote-end signal such that a frame number f(i) of a frame included in the voice segment length and a flag of voice activity detection (hereinafter referred to as flag vad) set to 1 (flag vad=1) to indicate that the frame is in the voice segment are added to the voice segment length. The detection unit 3 may add associated information to the detected non-voice segment length in the first remote-end signal such that a frame number f(i) of a frame included in the non-voice segment length and a flag vad set to =0 (flag vad=0) to indicate that the frame is in the non-voice segment are added to the non-voice segment length. As for the method of detecting a voice segment and a non-voice segment in a given frame, various known methods may be used. For example, a method disclosed in Japanese Patent No. 4460580 may be employed. The detection unit 3 outputs the detected voice segment length and the non-voice segment length in the first remote-end signal to the control unit 5.
The calculation unit 4 is realized, for example, by a wired logic hardware circuit. Alternatively, the calculation unit 4 may be a function module realized by a computer program executed in the voice processing device 1. The calculation unit 4 receives the near-end signal from the receiving unit 2. The calculation unit 4 calculates a noise characteristic value of ambient noise included in the near-end signal. The calculation unit 4 outputs the calculated noise characteristic value of the ambient noise to the control unit 5.
An example of a method of calculating the noise characteristic value of ambient noise by the calculation unit 4 is described below. First, the calculation unit 4 calculates near-end signal power (S(i)) from the near-end signal (Sin). For example, in a case where each frame of the near-end signal (Sin) includes 160 samples (with a sampling rate of 8 kHz), the calculation unit 4 calculates the near-end signal power (S(i)) according to a formula (1) described below.
Next, the calculation unit 4 calculates the average near-end signal power (S_ave(i)) from the near-end signal power (S(i)) of the current frame (i-th frame). For example, the calculation unit 4 calculation the average near-end signal power (S_ave(i)) for past 20 frames according to a formula (2) described below.
The calculation unit 4 then compares the difference near-end signal power (S_dif(i)) defined by the difference between the near-end signal power (S(i)) and the average near-end signal power (S_ave(i)) with an ambient noise level threshold value (TH_noise). When the difference near-end signal power (S_dif(i)) is equal to or greater than the ambient noise level threshold value (TH_noise), the calculation unit 4 determines that the near-end signal power (S(i)) indicates an ambient noise value (N). Herein, the ambient noise value(N) may be referred to as a noise characteristic value of the ambient noise. The ambient noise level threshold value (TH_noise) may be set to an arbitrary value in advance such that, for example, TH_noise=3 dB.
In a case where the difference near-end signal power (S_dif(i)) is equal to or greater than the ambient noise level threshold value (TH_noise), the calculation unit 4 may update the ambient noise value (N) using a formula (3) described below
N(i)=N(i−1) (3)
On the other hand, in a case where the difference near-end signal power (S_dif(i)) is smaller than the ambient noise level threshold value (TH_noise), the calculation unit 4 may update the ambient noise value (N) using a formula (4) described below.
N(i)=α×S(i)+(1−α)×N(i−1) (4)
where α is an arbitrarily defined particular value in a range from 0 to 1. For example, α=0.1. An initial value N(0) of the ambient noise value (N) may also be set arbitrarily to a particular value, such as, for example, N(0)=0.
The control unit 5 illustrated in
The process of controlling the first remote-end signal by the control unit 5 is described in further detail below.
In
As illustrated in the relationship diagram in
In
The generation unit 8 generates control information #1 (ctrl-1) based on the voice segment length, the non-voice segment length, the control amount (non_sp) of the non-voice segment length, and the delay, and the generation unit 8 outputs the generated control information #1 (ctrl-1), the voice segment length, and the non-voice segment length to the processing unit 9. Next, the process of producing the control information #1 (ctrl-1) by the generation unit 8 is described below. For the voice segment length, the generation unit 8 generates the control information #1(ctrl-1) as ctrl-1=0. Note that when ctrl-1=0, the control processing including the extension or the reduction is not performed on the first remote-end signal. On the other hand, for the non-voice segment length, the generation unit 8 generates the control information #1 (ctrl-1) by setting the control information #1 (ctrl-1) based on the control amount (non_sp) received from the determination unit 7, for example, such that ctrl-1=non_sp. In a case where in the non-voice segment length the delay is greater than an upper limit (delay_max) that may be arbitrarily determined in advance, the generation unit 8 may set the control information #1 (ctrl-1) such that ctrl-1=0 so that the delay is not further increased. The upper limit (delay_max) may be set to a value that is subjectively regarded as allowable in the two-way voice communication. For example, the upper limit (delay_max) may be set to 1 second.
The processing unit 9 receives the control information #1 (ctrl-1), the voice segment length, and the non-voice segment length from the generation unit 8. The processing unit 9 also receives the first remote-end signal that is input to the control unit 5 from the receiving unit 2. The processing unit 9 outputs the above-described delay to the generation unit 8. The processing unit 9 controls the first remote-end signal where the control includes reducing or increasing of the non-voice segment.
If the processing unit 9 inserts a non-voice segment in the first remote-end signal, part of the original first remote-end signal is delayed before being output. In view of this, the processing unit 9 may store a frame whose output is to be delayed in a buffer (not illustrated) or a memory (not illustrated) in the processing unit 9. In a case where the delay is estimated to be greater than a predetermined upper limit (delay_max), the extending of the non-voice segment may not be performed. On the other hand, in a case where there is a continuous non-voice segment length equal to or greater than a particular value (for example, 10 seconds), the processing unit 9 may perform a process of reducing the non-voice segment (described later) to reduce the non-voice segment length, which may reduce the generated delay.
The reducing of the non-voice segment length by the processing unit 9 results in a partial removal of the first remote-end signal, which provides an advantageous effect that the delay is reduced. However, there is a possibility that when the removed non-voice segment is equal to or greater than a particular value, a top or an end of a voice segment is lost. To handle such a situation, the processing unit 9 may calculate a time length of the continuous non-voice state since the beginning thereof to the current point of time, and store the calculated value in a buffer (not illustrated) or a memory (not illustrated) in the processing unit 9. Based on the calculated value, the processing unit 9 may control the reduction of the non-voice segment length such that the continuous non-voice time is not smaller than a particular value (for example, 0.1 seconds). Note that the processing unit 9 may vary the reduction ratio or the extension ratio of the non-voice segment depending on the age and/or the hearing ability of a user at the near-end side.
In
When the detection unit 3 receives the first remote-end signal from the receiving unit 2, the detection unit 3 detects a non-voice segment length and a voice segment length in the first remote-end signal (step S802). The detection unit 3 outputs the detected non-voice segment length and voice segment length in the first remote-end signal to the control unit 5.
When the calculation unit 4 receives the near-end signal from the receiving unit 2, the calculation unit 4 calculates a noise characteristic value of ambient noise included in the near-end signal (step S803). The calculation unit 4 outputs the calculated noise characteristic value of the ambient noise to the control unit 5. Hereinafter, the near-end signal will also be referred to as a third signal.
The control unit 5 receives the first remote-end signal from the receiving unit 2, the voice segment length and the non-voice segment length in the first remote-end signal from the detection unit 3, and the noise characteristic value from the calculation unit 4. The control unit 5 controls the first remote-end signal based on the voice segment length, the non-voice segment length, and the noise characteristic value, and the control unit 5 outputs a resultant signal as a second remote-end signal to the output unit 6 (step S804).
The output unit 6 receives the second remote-end signal from the control unit 5, and the output unit 6 outputs the second remote-end signal as an output signal to the outside (step S805).
The receiving unit 2 determines whether the receiving of the first remote-end signal is still being continuously performed (step S806). In a case where the receiving unit 2 is no longer continuously receiving the first remote-end signal (No, in step S806), the voice processing device 1 ends the voice processing illustrated in the flow chart of the
Thus, the voice processing device according to the first embodiment is capable of improving the easiness for a listener to hear a voice.
Second Embodiment
In
In the two-way voice communication, the greater the noise in the first remote-end signal, the more the easiness of hearing at the receiving side may be reduced. In the voice processing device 1 according to the second embodiment, the adjustment amount is controlled in the above-described manner thereby improving the easiness for a listener to hear a voice.
Third Embodiment
In
Note that when ctrl-2=0, the control processing including the extension or the reduction is not performed on the voice segment of the first remote-end signal. For the voice segment length, the generation unit 8 generates the control information #2 (ctrl-2) such that, for example, ctrl-2=er where er indicates the extension ratio of the voice segment. Note that even for the voice segment length, the generation unit 8 may generate the control information #2 (ctrl-2) such that ctrl-2=0 depending on the delay. The generation unit 8 outputs the resultant control information #2 (ctrl-2) to the processing unit 9. Next, a process of determining the extension ratio of the voice segment length is described below.
As described above, when the speech rate is high (that is, the number of moras per unit time is large), this may cause a reduction in easiness for aged people to hear a speech. When there is ambient noise, a received voice may be masked by the ambient noise, which may cause a reduction in listening easiness for listeners regardless of whether the listeners are old or not old. In particular, in a situation in which a speech is made at a high speech rate in a circumstance where there is ambient noise, the high speech rate and the ambient noise lead to a synergetic effect that causes a great reduction in the listening easiness for aged people. On the other hand, in the two-way voice communication, if voice segments are increased without limitation, an increase in delay occurs which makes it difficult to communicate. In view of the above, the relationship diagram in
In
In the voice processing device according to the third embodiment, in addition to controlling non-voice segment lengths, voice segment lengths are controlled depending on ambient noise thereby improving the easiness for a listener to hear a voice.
Fourth Embodiment
In the voice processing device 1 illustrated in
The detection unit 3 receives the first remote-end signal from the receiving unit 2, and detects a non-voice segment length and a voice segment length in the first remote-end signal. The detection unit 3 may detect the non-voice segment length and the voice segment length in a similar manner as in the first embodiment, and thus a further description thereof is omitted. The detection unit 3 outputs the detected voice segment length and non-voice segment length in the first remote-end signal to the control unit 5.
The control unit 5 receives the first remote-end signal from the receiving unit 2, and the voice segment length and the non-voice segment length in the first remote-end signal from the detection unit 3. The control unit 5 controls the first remote-end signal based on the voice segment length and the non-voice segment length and outputs a resultant signal as a second remote-end signal to the output unit 6. More specifically, the control unit 5 determines whether the non-voice segment length is equal to or greater than a first threshold value above which it allowed for the listener at the receiving side to distinguish between words represented by respective voice segments. In a case where the non-voice segment length is smaller than the first threshold value, the control unit 5 controls the non-voice segment length such that the non-voice segment length is equal to or greater than the first threshold value. The first threshold value may be determined experimentally, for example, using a subjective evaluation. More specifically, for example, the first threshold value may be set to 0.2 seconds. Alternatively, the control unit 5 may analyze words in a voice segment using a known technique, and may control a period between words so as to be equal or greater than the first threshold value thereby achieving an improvement in listening easiness for the listener.
As described above, in the voice processing device according to the fourth embodiment, the non-voice segment length is properly controlled to increase the easiness for the listener to hear voices.
Fifth Embodiment
The control unit 21 is a CPU that controls the units in the computer and also performs operations, processing, and the like on data. The control unit 21 also functions as an operation unit that executes a program stored in the main storage unit 22 or the auxiliary storage unit 23. That is, the control unit 21 receives data from the input unit 27 or the storage apparatus and performs an operation or processing on the received data. A result is output to the display unit 28, the storage apparatus, or the like.
The main storage unit 22 is a storage device such as a ROM, a RAM, or the like configured to store or temporarily store an operating system (OS) which is a basic software, a program such as application software, and data, for use by the control unit 21.
The auxiliary storage unit 23 is a storage apparatus such as an HDD or the like, configured to stored data associated with the application software or the like.
The drive device 24 reads a program from a storage medium 25 such as a flexible disk and installs the program in the auxiliary storage unit 23.
A particular program may be stored in the storage medium 25, and the program stored in the storage medium 25 may be installed in the voice processing device 1 via the drive device 24 such that the installed program may be executed by the voice processing device 1.
The network I/F unit 26 functions as an interface between the voice processing device 1 and a peripheral device having a communication function and connected to the voice processing device 1 via a network such as a local area network (LAN), a wide area network (WAN), or the like build using a wired or wireless data transmission line.
The input unit 27 includes a keyboard including a cursor key, numerical keys, various functions keys, and the like, a mouse or a slide pad for selecting a key on a display screen of the display unit 28. The input unit 27 functions as a user interface that allows a user to input an operation command or data to the control unit 21.
The display unit 28 may include a cathode ray tube (CRT), a liquid crystal display (LCD) or the like and is configured to display information according to display data input from the control unit 21.
The voice processing method described above may be realized by a program executed by a computer. That is, the voice processing method may be realized by installing the program from a server or the like and executing the program by the computer.
The program may be stored in the storage medium 25 and the program stored in the storage medium 25 may be read by a computer, a portable communication device, or the like thereby realizing the voice processing described above. The storage medium 15 may be of various types. Specific examples include a storage medium such as a CD-ROM, a flexible disk, a magneto-optical disk or the like capable of storing information optically, electrically, or magnetically, a semiconductor memory such as a ROM, a flash memory, or the like, capable of electrically storing information, and so on.
Sixth Embodiment
The antenna 31 transmits a wireless transmission signal amplified by a transmission amplifier, and receives a wireless reception signal from a base station. The wireless transmission/reception unit 32 performs a digital-to-analog conversion on a transmission signal spread by the baseband processing unit 33 and converts a resultant signal into a high-frequency signal by orthogonal modulation, and furthermore amplifies the high-frequency signal by a power amplifier. The wireless transmission/reception unit 32 amplifies the received wireless reception signal and performs an analog-to-digital conversion on the amplified signal. A resultant signal is transmitted to the baseband processing unit 33.
The baseband processing unit 33 performs baseband processes including addition of error correction code to the transmission data, data modulation, spread modulation, inverse spread modulation of the received signal, determination of the receiving environment, determination of a threshold value of each channel signal, error correction decoding, and the like.
The control unit 21 controls a wireless transmission/reception process including controlling transmission/reception of a control signal. The control unit 21 also executes a voice processing program stored in the auxiliary storage unit 23 or the like to perform, for example, the voice processing according to the first embodiment.
The main storage unit 22 is a storage device such as a ROM, a RAM, or the like configured to store or temporarily store an operating system (OS) which is a basic software, a program such as application software, and data, for use by the control unit 21.
The auxiliary storage unit 23 is a storage device such as an HDD, an SSD, or the like, configured to stored data associated with the application software or the like.
The device interface unit 34 performs a process to interface with a data adapter, a handset, an external data terminal, or the like.
The microphone 35 senses an ambient sound including a voice of a talker, and outputs the sensed sound as a microphone signal to the control unit 21. The speaker 36 outputs a signal received from the control unit 21 as an output signal.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2012-270916 | Dec 2012 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
3700820 | Blasbalg | Oct 1972 | A |
4167653 | Araseki et al. | Sep 1979 | A |
5794201 | Nejime et al. | Aug 1998 | A |
6377915 | Sasaki | Apr 2002 | B1 |
8364471 | Yoon | Jan 2013 | B2 |
9142222 | Lee | Sep 2015 | B2 |
20020032571 | Leung | Mar 2002 | A1 |
20050234715 | Ozawa | Oct 2005 | A1 |
20070118363 | Sasaki et al. | May 2007 | A1 |
20090086934 | Thomas | Apr 2009 | A1 |
20090248409 | Endo et al. | Oct 2009 | A1 |
20110264447 | Visser | Oct 2011 | A1 |
20120127343 | Park | May 2012 | A1 |
20130006622 | Khalil et al. | Jan 2013 | A1 |
20140288925 | Sverrisson | Sep 2014 | A1 |
Number | Date | Country |
---|---|---|
4227826 | Feb 1993 | DE |
0 534 410 | Mar 1993 | EP |
1 515 310 | Mar 2005 | EP |
1 840 877 | Oct 2007 | EP |
2000-349893 | Dec 2000 | JP |
2001-211469 | Aug 2001 | JP |
2008-58956 | Mar 2008 | JP |
2009-75280 | Apr 2009 | JP |
4460580 | May 2010 | JP |
WO 02082428 | Oct 2002 | WO |
Entry |
---|
European Search Report issued Feb. 3, 2014 for European Application No. 13192457.3. |
Tomono Miki et al., “Development of Radio and Television Receiver with Speech Rate Conversion Technology”, CASE#10-03, p. 1-29, Institute of Innovation Research, Hitotsubashi University, Apr. 2010. |
Number | Date | Country | |
---|---|---|---|
20140163979 A1 | Jun 2014 | US |