This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2014-126828 filed on Jun. 20, 2014, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a voice processing device, a voice processing method, a voice processing program and a portable terminal apparatus, for example, for estimating an utterance time period.
Recently, together with the development of information processing apparatus, a scene that conversation is performed through a conversation application installed, for example, in a portable terminal or a personal computer has been and is increasing. When oneself and other party talk, smooth communication can be implemented by proceeding with a dialog while understanding thinking of each other. In this case, in order for oneself to understand the thinking of the other party, it is considered important for oneself to sufficiently listen to the utterance of the other party without unilaterally continuing the utterance. A technology for detecting utterance time periods of oneself and other party with a high degree of accuracy from input voices is demanded in order to grasp whether or not smooth communication is implemented successfully. For example, by detecting utterance time periods of oneself and other party, it can be determined whether or not the discussion is being conducted actively by both of oneself and the other party. Further, by such detection, it is possible in learning of a foreign language to determine whether or not a student understands the foreign language and speaks actively. In such a situation as described above, for example, International Publication Pamphlet No. WO 2009/145192 discloses a technology for evaluating signal quality of an input voice and estimating an utterance temporal segment on the basis of a result of the evaluation.
In accordance with an aspect of the embodiments, a voice processing device includes a memory; and a processor configured to execute a plurality of instructions stored in the memory, the instructions includes acquiring a transmitted voice; first detecting a first utterance segment of the transmitted voice; second detecting a response segment from the first utterance segment; determining a frequency of the response segment included in the transmitted voice; and estimating an utterance time period of a received voice on a basis of the frequency.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
These and/or other aspects and advantages will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawing of which:
In the following, working examples of a voice processing device, a voice processing method, a voice processing program and a portable terminal apparatus according to one embodiment are described in detail with reference to the drawings. It is to be noted that the working examples do not restrict the technology disclosed herein.
The acquisition unit 2 is, for example, a hardware circuit configured by hard-wired logic. The acquisition unit 2 may otherwise be a functional module implemented by a computer program executed by the voice processing device 1. The acquisition unit 2 acquires a transmission voice (in other words, a transmitted voice) that is an example of an input voice, for example, through an external apparatus. It is to be noted that the process just described corresponds to step S201 of the flow chart depicted in
The detection unit 3 is, for example, a hardware circuit configured by hard-wired logic. The detection unit 3 may otherwise be a functional module implemented by a computer program executed by the voice processing device 1. The detection unit 3 receives a transmission voice from the acquisition unit 2. The detection unit 3 detects a breath temporal segment indicative of an utterance temporal segment (which may be referred to as first utterance temporal segment or voiced temporal segment) included in the transmission voice. It is to be noted that the process just described corresponds to step S202 of the flow chart depicted in
Here, details of the detection process of an utterance temporal segment and an unvoiced temporal segment by a detection unit are described.
Referring to
S(n)=Σt=n*M(n+1)*M−1c(t)2 (Expression 1)
Here, n is a frame number successively applied to each of the frames beginning with starting of inputting of acoustic frames included in the transmission voice (n is an integer equal to or greater than zero); M a time length of one frame; t time; and c(t) an amplitude (power) of the transmission voice.
The noise estimation portion 10 receives the sound volume S(n) of each frame from the sound volume calculation portion 9. The noise estimation portion 10 estimates noise in each frame and outputs a result of the noise estimation to the average SNR calculation portion 11. Here, the noise estimation of each frame by the noise estimation portion 10 can be performed using, for example, a (noise estimation method 1) or a (noise estimation method 2) described below.
(Noise Estimation Method 1)
The noise estimation portion 10 can estimate the magnitude (power) N(n) of noise in a frame n using the expression given below on the basis of the sound volume S(n) in the frame n, the sound volume S(n−1) in the preceding frame (n−1) and the magnitude N(n−1) of noise.
Here, α and β are constants, which may be determined experimentally. For example, α and β may be α=0.9 and β=2.0, respectively. Also the initial value N(−1) of the noise power may be determined experimentally. In the (Expression 2) given above, the noise power N(n) of the frame n is updated when the sound volume S(n) of the frame n does not exhibit a variation equal to or greater than the fixed value β from the sound volume S(n−1) of the immediately preceding frame n−1. On the other hand, when the sound volume S(n) of the frame n exhibits a variation equal to or greater than the fixed value β from the sound volume S(n−1) of the immediately preceding frame n−1, the noise power N(n−1) of the immediately preceding frame n−1 is set as the noise power N(n) of the frame n. It is to be noted that the noise power N(n) may be referred to as the noise estimation result described above.
(Noise Estimation Method 2)
The noise estimation portion 10 may perform updating of the magnitude of noise on the basis of the ratio between the sound volume S(n) of the frame n and the noise power N(n−1) of the immediately preceding frame n−1 using the expression (3) given below:
Here, γ is a constant, which may be determined experimentally. For example, γ may be γ=2.0. Also the initial value N(−1) of the noise power may be determined experimentally. If, in the (Expression 3) given above, the sound volume S(n) of the frame n is smaller by γ times the fixed value than the noise power N(n−1) of the immediately preceding frame n−1, then the noise power N(n) of the frame n is updated. On the other hand, if the sound volume S(n) of the frame n is equal to or greater by γ times the fixed value than the noise power N(n−1) of the immediately preceding frame n−1, then the noise power N(n−1) of the immediately preceding frame n−1 is set as the noise power N(n) of the frame n.
Referring to
Here, L may be set to a value higher than a general length of an assimilated sound and may be set to a number of frames corresponding, for example, to 0.5 msec.
The temporal segment determination portion 12 receives an average SNR from the average SNR calculation portion 11. The temporal segment determination portion 12 includes a buffer or a cache not depicted and retains a flag n_breath indicative of whether or not a pre-processing frame by the temporal segment determination portion 12 is within an utterance temporal segment (in other words, within a breath temporal segment). The temporal segment determination portion 12 detects a start point Ts(n) of an utterance temporal segment using the expression (5) given below and an end point Te(n) of the utterance temporal segment using the expression (6) given below on the basis of the average SNR and the flag n_breath:
Ts(n)=n×M (Expression 5)
(if n_breath=no utterance temporal segment and SNR(n)>THSNR)
Te(n)=n×M−1 (Expression 6)
(if n_breath=utterance temporal segment and SNR(n)<THSNR)
Here, THSNR is an arbitrary threshold value for regarding that the processed frame n by the temporal segment determination portion 12 does not include noise (the threshold value may be referred to as fifth threshold value (for example, fifth threshold value=12 dB)), and may be set experimentally. It is to be noted that the start point Ts(n) of the utterance temporal segment can be regarded as a sample number at the start point of the utterance temporal segment, and the end point Te(n) can be regarded as a sample number at the end point Te(n) of the utterance temporal segment. Further, the temporal segment determination portion 12 can detect a temporal segment other than utterance temporal segments in a transmission voice as an unvoiced temporal segment.
Referring to
The calculation unit 4 calculates the temporal segment length L(n) of an utterance temporal segment, which is an example of the first feature value, from a start point and an end point of the utterance temporal segment using the following expression:
L(n)=Te(n)−Ts(n) (Expression 7)
It is to be noted that, in the (Expression 7) above, Ts(n) is a sample number at the start point of the utterance temporal segment, and Te(n) is a sample number at an end point of the utterance temporal segment. It is to be noted that Ts(n) and Te(n) can be calculated using the (Expression 5) and the (Expression 6) given hereinabove, respectively. Further, the calculation unit 4 detects the number of vowels within an utterance temporal segment, which is an example of the first feature value, for example, from a Formant distribution. The calculation unit 4 can use, as the detection method of the number of vowels based on a Formant distribution, the method disclosed, for example, in Japanese Laid-open Patent Publication No. 2009-258366. The calculation unit 4 outputs the calculated first feature value to the determination unit 5.
The determination unit 5 is, for example, a hardware circuit configured by hard-wired logic. In addition, the determination unit 5 may be a functional module implemented by a computer program executed by the voice processing device 1. The determination unit 5 receives a first feature value from the calculation unit 4. The determination unit 5 determines a frequency of appearance of a second feature value, with which the first feature value is smaller than a given first threshold value, in a transmission voice. In other words, the determination unit 5 determines a frequency that a second feature value appears in a transmission voice as a response (back-channel feedback) to an utterance of a reception voice (in other words, a received voice). In still other words, on the basis of the first feature value, the determination unit 5 determines a frequency that a second feature value appearing in a transmission voice as a response to understanding of a reception voice appears in the transmission voice within an utterance temporal segment of the reception voice (the utterance temporal segment may be referred to as second utterance temporal segment). It is to be noted that the process just described corresponds to step S204 of the flow chart depicted in
Further, the determination unit 5 may recognize a transmission voice as a character string and determine a number of times of appearance by which a given word corresponding to the second feature value appears as a frequency of appearance of the second feature value from the character string. The determination unit 5 can apply, as the method for recognizing a transmission voice as a character string, the method disclosed, for example, in Japanese Laid-open Patent Publication No. 04-255900. Further, such given words are words that correspond to back-channel feedbacks stored in a word list (table) written in a cache or a memory not depicted provided in the determination unit 5. The given words may be words that generally correspond to back-channel feedbacks such as, for example, “yes,” “no,” “year,” “really?” and “that's right.”
Then, the determination unit 5 determines a number of times of appearance of the second feature value per unit time period as a frequency. The determination unit 5 can calculate the number of times of appearance of the second feature value corresponding to a back-channel feedback, for example, per one minute as a frequency freq(t) using the following expression:
It is to be noted that, in the (Expression 8) above, L(n) is a temporal segment length of the utterance temporal segment; Ts(n) is a sample number at the start point of the utterance temporal segment; TH2 is the second threshold value; and TH3 is the third threshold value.
When the determination unit 5 recognizes the above-described transmission voice as a character string and determines the number of times of appearance by which a given word corresponding to the second feature value appears from the character string, the determination unit 5 may utilize an appearance interval of the second feature value per unit time period as a frequency. The determination unit 5 can calculate an average time interval after which the second feature value corresponding to a back-channel feedback appears, for example, per one minute as the frequency freq′(t) using the following expression:
It is to be noted that, in the (Expression 9) above, Ts′(n) is a sample number at the start point of a second feature value temporal segment, and Te′(n) is a sample number at the end point of the second feature value temporal segment.
Furthermore, the determination unit 5 may determine a ratio of the number of times of appearance of the second feature value to the temporal segment number of utterance temporal segments as a frequency. In other words, the determination unit 5 can calculate the frequency freq″(t) in which the second feature value appears in accordance with the following expression using the number of times of appearance of an utterance temporal segments and the number of times of appearance of the second feature value corresponding to a back-channel feedback, for example, per one minute:
It is to be noted that, in the (Expression 10) above, L(n) is a temporal segment length of the utterance temporal segment; Ts(n) is the sample number at the start point of the utterance temporal segment; NV(n) is the second feature value; TH2 is the second threshold value; and TH3 is the third threshold value. The determination unit 5 outputs the determined frequency to the estimation unit 6.
The estimation unit 6 is, for example, a hardware circuit configured by hard-wired logic. Besides, the estimation unit 6 may be a functional module implemented by a computer program executed by the voice processing device 1. The estimation unit 6 receives a frequency from the determination unit 5. The estimation unit 6 estimates an utterance time period of the reception voice (second user) on the basis of the frequency. It is to be noted that the process just described corresponds to step S205 of the flow chart depicted in
Here, a technological significance of an estimation of an utterance time period of a reception voice on the basis of a frequency in the working example 1 is described. As a result of intensive verification of the inventors of the present technology, the technological matters described below became apparent. The inventors paid attention to the presence of the nature that, while a second user (other party) is talking, a first user (oneself) performs a back-channel feedback behavior, and newly performed intensive verification of the possibility that it may be able to estimate an utterance time period of the other party (which may be referred to as utterance time period of a reception voice) making use of the frequency of the back-channel feedback of the first user.
As depicted in
The estimation unit 6 estimates the utterance time period of a reception voice on the basis of a first correlation between the frequency and the utterance time period determined in advance. It is to be noted that the first correlation can be suitably set experimentally on the basis of, for example, the correlation depicted in
Besides, when the total value of the temporal segment length of the utterance temporal segments is lower than a fourth threshold value (for example, the fourth threshold value=15 sec), the estimation unit 6 may estimate the utterance time period of the reception voice on the basis of the frequency and a second correlation with which the utterance time period of a reception voice is set shorter than the utterance time period of a reception voice with the first correlation described hereinabove. The estimation unit 6 calculates the total value TL1(t) of the temporal segment length of the utterance temporal segments per unit time period (for example, per one minute) using the following expression:
TL1(t)=Σt<Ts(n)<t+60L(n) (Expression 11)
It is to be noted that, in the (Expression 11) above, L(n) is a temporal segment length of an utterance temporal segment, and Ts(n) is a sample number at the start point of the utterance temporal segment.
The estimation unit 6 outputs an estimated utterance time period of a reception voice to an external apparatus. It is to be noted that the process just described corresponds to step S206 of the flow chart depicted in
R(t)=TL2(t)/TL1(t) (Expression 12)
It is to be noted that, in the (Expression 12) above, TL1(t) can be calculated using the (Expression 11) given hereinabove and TL2(t) can be calculated using a method similar to the method for TL1(t), and therefore, detailed descriptions of TL1(t) and TL2(t) are omitted herein.
The estimation unit 6 originates a control signal on the basis of comparison represented by the following expression between the ratio R(t) calculated using the (Expression 12) given above and a given sixth threshold value (for example, the sixth threshold value=0.5):
If R(t)<TH5CS(t)=1(control signal originated)
elseCS(t)=0(control signal not originated) (Expression 13)
With the voice processing device according to the working example 1, the utterance time period of a reception voice can be estimated without relying upon ambient noise.
The reception unit 7 is, for example, a hardware circuit configured by hard-wired logic. Besides, the reception unit 7 may be a functional module implemented by a computer program executed by the voice processing device 20. The reception unit 7 receives a reception voice, which is an example of an input voice, for example, through a wired circuit or a wireless circuit. The reception unit 7 outputs the received reception voice to the evaluation unit 8.
The evaluation unit 8 receives a reception voice from the reception unit 7. The evaluation unit 8 evaluates a second signal-to-noise ratio of the reception voice. The evaluation unit 8 can apply, as an evaluation method of a second signal-to-noise ratio, a technique similar to the technique for detection of the first signal-to-noise ratio by the detection unit 3 in the working example 1. The evaluation unit 8 evaluates an average SNR that is an example of the second signal-to-noise ratio, for example, using the (Expression 4) given hereinabove. If the average SNR that is an example of the second signal-to-noise ratio is lower than a given seventh threshold value (for example, the seventh threshold value=10 dB), then the evaluation unit 8 issues an instruction to carry out a voice processing method on the basis of the working example 1 to the acquisition unit 2. In other words, the acquisition unit 2 determines whether or not a transmission voice is to be acquired on the basis of the second signal-to-noise ratio. On the other hand, if the average SNR that is an example of the second signal-to-noise ratio is equal to or higher than the seventh threshold value, then the evaluation unit 8 outputs the reception voice to the detection unit 3 so that the detection unit 3 detects the utterance temporal segment of the reception voice (the utterance temporal segment may be referred to as second utterance temporal segment). It is to be noted that, as the detection method for an utterance temporal segment of the reception voice, the detection method of a first utterance temporal segment disclosed through the working example 1 can be used similarly, and therefore, detailed description of the detection method is omitted herein. The detection unit 3 outputs the detected utterance temporal segment of the reception voice (second utterance temporal segment) to the estimation unit 6.
The estimation unit 6 uses the utterance time period L of the reception voice estimated by the method disclosed through the working example 1 to estimate a central temporal segment [Ts2, Te2] within a temporal segment [Ts1, Te1] within which the second feature value per unit time period appears as the utterance temporal segment of the reception voice. It is to be noted that the central temporal segment [Ts2, Te2] can be calculated using the following expression:
Ts2=(Ts1+Te1)/2−L/2 (Expression 14)
Tet=(Ts1+Te1)/2+L/2
With the voice processing device according to the working example 2, it is possible to estimate an utterance time period of a reception voice in accordance with signal quality of the reception voice without relying upon ambient noise. Further, with the voice processing device according to the working example 2, it is possible to estimate an utterance temporal segment of a reception voice.
The antenna 31 transmits a wireless signal amplified by a transmission amplifier and receives a wireless signal from a base station. The wireless unit 32 digital-to-analog converts a transmission signal spread by the baseband processing unit 33, converts a resulting analog transmission signal into a high frequency signal by orthogonal transformation and amplifies the high frequency signal by a power amplifier. The wireless unit 32 amplifies a received wireless signal, analog-to-digital converts the amplified signal and transmits a resulting digital signal to the baseband processing unit 33.
The baseband processing unit 33 performs baseband processes such as error correction coding of transmission data, data modulation, determination of a reception signal and a reception environment, threshold value determination for channel signals and error correction decoding.
The control unit 37 is, for example, a Central Processing Unit (CPU), a Micro Processing Unit (MPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC) or a Programmable Logic Device (PLD). The control unit 37 performs wireless control such as transmission and reception of a control signal. Further, the control unit 37 executes a voice processing program stored in the auxiliary storage unit 39 or the like and performs, for example, the voice processes in the working example 1 or the working example 2. In other words, the control unit 37 can execute processing of the functional blocks such as, for example, the acquisition unit 2, the detection unit 3, the calculation unit 4, the determination unit 5, the estimation unit 6, the reception unit 7 and the evaluation unit 8 depicted in
The main storage unit 38 is a Read Only Memory (ROM), a Random Access Memory (RAM) or the like and is a storage device that stores or temporarily retains data and programs such as an Operating System (OS), which is basic software, and application software that are executed by the control unit 37.
The auxiliary storage unit 39 is a Hard Disk Drive (HDD), a Solid State Drive (SSD) or the like and is a storage device for storing data relating to application software or the like.
The terminal interface unit 34 performs adapter processing for data and interface processing with a handset and an external data terminal.
The microphone 35 receives a voice of an utterer (for example, a first user) as an input thereto and outputs the voice as a microphone signal to the control unit 37. The speaker 36 outputs a signal outputted from the control unit 37 as an output voice or a control signal.
The computer 100 is controlled entirely by a processor 101. To the processor 101, a RAM 102 and a plurality of peripheral apparatuses are coupled through a bus 109. It is to be noted that the processor 101 may be a multiprocessor. Further, the processor 101 is, for example, a CPU, an MPU, a DSP, an ASIC or a PLD. Further, the processor 101 may be a combination of two or more of a CPU, an MPU, a DSP, an ASIC and a PLD. It is to be noted that, for example, the processor 101 may execute processes of functional blocks such as the acquisition unit 2, the detection unit 3, the calculation unit 4, the determination unit 5, the estimation unit 6, the reception unit 7 and the evaluation unit 8 depicted in
The RAM 102 is used as a main memory of the computer 100. The RAM 102 temporarily stores at least part of a program of an OS and application programs to be executed by the processor 101. Further, the RAM 102 stores various data to be used for processing by the processor 101. The peripheral apparatuses coupled to the bus 109 include an HDD 103, a graphic processing device 104, an input interface 105, an optical drive unit 106, an apparatus coupling interface 107 and a network interface 108.
The HDD 103 performs writing and reading out of data magnetically on and from a disk built in the HDD 103. The HDD 103 is used, for example, as an auxiliary storage device of the computer 100. The HDD 103 stores a program of an OS, application programs and various data. It is to be noted that also a semiconductor storage device such as a flash memory can be used as an auxiliary storage device.
A monitor 110 is coupled to the graphic processing device 104. The graphic processing device 104 controls the monitor 110 to display various images on a screen in accordance with an instruction from the processor 101. The monitor 110 may be a display unit that uses a Cathode Ray Tube (CRT), a liquid crystal display unit or the like.
To the input interface 105, a keyboard 111 and a mouse 112 are coupled. The input interface 105 transmits a signal sent thereto from the keyboard 111 or the mouse 112 to the processor 101. It is to be noted that the mouse 112 is an example of a pointing device and also it is possible to use a different pointing device. As the different pointing device, a touch panel, a tablet, a touch pad, a track ball and so forth are available.
The optical drive unit 106 performs reading out of data recorded on an optical disc 113 utilizing a laser beam or the like. The optical disc 113 is a portable recording medium on which data are recorded so as to be read by reflection of light. As the optical disc 113, a Digital Versatile Disc (DVD), a DVD-RAM, a Compact Disc Read Only Memory (CD-ROM), a CD-R (Recordable)/RW (ReWritable) and so forth are available. A program stored on the optical disc 113 serving as a portable recording medium is installed into the voice processing device 1 through the optical drive unit 106. The given program installed in this manner is enabled for execution by the voice processing device 1.
The apparatus coupling interface 107 is a communication interface for coupling a peripheral apparatus to the computer 100. For example, a memory device 114 or a memory reader-writer 115 can be coupled to the apparatus coupling interface 107. The memory device 114 is a recording medium that incorporates a communication function with the apparatus coupling interface 107. The memory reader-writer 115 is an apparatus that performs writing of data into a memory card 116 and reading out of data from the memory card 116. The memory card 116 is a card type recording medium. To the apparatus coupling interface 107, a microphone 35 and a speaker 36 can be coupled further.
The network interface 108 is coupled to a network 117. The network interface 108 performs transmission and reception of data to and from a different computer or a communication apparatus through the network 117.
The computer 100 implements the voice processing function described hereinabove by executing a program recorded, for example, on a computer-readable recording medium. A program that describes the contents of processing to be executed by the computer 100 can be recorded on various recording media. The program can be configured from one or a plurality of functional modules. For example, the program can be configured from functional modules that implement the processes of the acquisition unit 2, the detection unit 3, the calculation unit 4, the determination unit 5, the estimation unit 6, the reception unit 7, the evaluation unit 8 and so forth depicted in
The components of the devices and the apparatus depicted in the figures need not necessarily be configured physically in such a manner as depicted in the figures. In particular, the particular form of integration or disintegration of the devices and apparatus is not limited to that depicted in the figures, and all or part of the devices and apparatus can be configured in a functionally or physically integrated or disintegrated manner in an arbitrary unit in accordance with loads, use situations and so forth of the devices and apparatus. Further, the various processes described in the foregoing description of the working examples can be implemented by execution of a program prepared in advance by a computer such as a personal computer or a work station.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2014-126828 | Jun 2014 | JP | national |