This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2015-171274, filed on Aug. 31, 2015, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an utterance condition determination apparatus.
As a technology to estimate an emotional condition of each speaker in a voice call, a technology has been known such that whether or not a speaker (an opposing speaker) is in a state of anger is determined by using the number of backchannel feedback of the speaker (see Patent Document 1 as an example).
As a technology to detect an emotional condition of a speaker (an opposing speaker) during a voice call, a technology has been known such that whether or not the speaker is in a state of excitement is detected by using intervals of backchannel utterance etc. (see Patent Document 2 as an example).
In addition, as a technology to detect backchannel feedbacks from voice signals, a technology has been known such that an utterance section of a voice signal is compared with backchannel data registered in a backchannel feedback dictionary and a section in the utterance section that matches the backchannel data is detected as a backchannel section (see Patent Document 3 as an example).
Moreover, as a technology to record a conversation between two people by a voice call etc., and to reproduce a recorded data of the conversation (the voice call) after the conversation is ended, a technology has been known such that a reproduction speed is changed in accordance with an speech rate of a speaker (see Patent Document 4 as an example).
Furthermore, it has been known that vowels can be used as a feature amount of a voice of a speaker (see Non-Patent Document 1 as an example).
Non Patent Document 1: “Onsei (voice) 1”, [online], [searched on Aug. 29, 2015], the Internet<URL:http://media.sys.wakayama-u.ac.jp/kawahara-lab/LOCAL/diss/diss7/S3_6.htm>
According to an aspect of the embodiment, an utterance condition determination device includes a memory configured to a voice signal of a first speaker and a voice signal of a second speaker, and a processor configured to estimate an average backchannel frequency that represents a backchannel frequency of the second speaker in a period of time from a voice start time of the voice signal of the second speaker to a predetermined time based on the voice signal of the first speaker and the voice signal of the second speaker, to calculate the backchannel frequency of the second speaker for each unit of time based on the voice signal of the first speaker and the voice signal of the second speaker, and to determine a satisfaction level of the second speaker based on the estimated average backchannel frequency and the calculated backchannel frequency.
The object and advantages of the embodiment will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the embodiment.
Preferred embodiments of the present invention will be explained with reference to accompanying drawings.
Estimation (determination) of whether or not a speaker is in a state of anger or in a state of dissatisfaction uses a relationship between an emotional condition and a way of giving backchannel feedback of the speaker. More specifically, the number of times of backchannel feedbacks is fewer when the speaker is angry or is dissatisfied than when the speaker is in a normal condition. Therefore, the emotional condition of the opposing speaker can be determined on the basis of the number of times of backchannel feedbacks as an example and a certain threshold prepared in advance.
However, because of individual variation in the number and interval of backchannel feedback, it is difficult to determine the emotional condition of a speaker based on a certain threshold. For example, in a case of a determination target speaker who infrequently gives backchannel feedback by nature, the number of times of backchannel feedbacks may be fewer than the threshold even though the speaker gives backchannel feedback more frequently than in his/her normal condition, and in such a case it is likely that the speaker is determined to be in a state of anger. In another example, in a case of a speaker who frequently gives backchannel feedback by nature, even though the speaker is in a state of anger and the number of times of backchannel feedbacks is fewer than his/her normal condition, it is likely that the speaker is determined to be in a normal condition. In the following description, backchannel feedback may be referred to as simply “backchannel”.
The first phone set 2 includes a microphone 201, a voice call processor 202, a receiver (speaker) 203, a display unit 204, and an utterance condition determination device 5. The utterance condition determination device 5 of the first phone set 2 is connected to the display device 6. Note that the number of the first phone set 2 is not limited to only one, but plural sets can be included.
The second phone set 3 is a phone set that can be connected to the first phone set 2 via the IP network 4. The second phone set 3 includes a microphone 301, a voice call processor 302, and a receiver (speaker) 303.
In this voice call system 100, a voice call with the use of the first and second phone sets 2 and 3 becomes available by making a call connection between the first phone set 2 and the second phone set 3 in accordance with the Session Initiation Protocol (SIP) through the IP network 4.
The first phone set 2 converts a voice signal of a first speaker collected by the microphone 201 into a signal for transmission in the voice call processor 202 and transmits the converted signal to the second phone set 3. The first phone set 2 also converts a signal received from the second phone set 3 into a voice signal that can be output from the receiver 203 in the voice call processor 202 and outputs the converted signal to the receiver 203.
The second phone set 3 converts a voice signal of the second speaker (the opposing speaker of the first speaker) collected by the microphone 301 into a signal for transmission in the voice call processor 302 and transmits the converted signal to the first phone set 2. The second phone set 3 also converts a signal received from the first phone set 2 into a voice signal that can be output from the receiver 303 in the voice call processor 302 and outputs the converted signal to the receiver 303.
The voice call processors 202 and 302 in the first phone set 2 and the second phone set 2, respectively, include an encoder, a decoder, and a transceiver unit, although these units are omitted in
The first phone set 2 in the voice call system 100 according to the present embodiment includes the utterance condition determination device 5 and the display unit 204 as described above. In addition, the utterance condition determination device 5 in the first phone set 2 is connected with the display device 6. The display deice 6 is used by another person who is different from the first speaker using the first phone set 2, and another person may be, for example, a supervisor who supervises the responses of the first speaker.
The utterance condition determination device 5 determines whether or not the utterance condition of the second speaker meets the satisfactory condition (i.e., the satisfaction level of the second speaker) based on the voice signals of the first speaker and the voice signals of the second speaker. The utterance condition determination device 5 also warns the first speaker through the display unit 204 or the display device 6 when the utterance condition of the second speaker does not meet the satisfactory condition. The display unit 204 displays the determination result of the utterance condition determination deice 5 (the satisfaction level of the second speaker) and warning etc. Moreover, the display device 6 connected to the first phone set 2 (the utterance condition determination device 5) displays a warning to the first speaker that the utterance condition determination device 5 issues.
The voice section detection unit 501 detects a voice section in voice signals of the first speaker. The voice section detection unit 501 detects a section in which the power obtained from a voice signal is at or above a certain threshold TH from among the voice signals of the first speaker as a voice section.
The backchannel section detection unit 502 detects a backchannel section in voice signals of the second speaker. The backchannel section detection unit 502 performs morphological analysis of the voice signals of the second speaker and detects a section that matches any piece of backchannel data registered in a backchannel dictionary that is not illustrated in
The backchannel frequency calculation unit 503 calculates the number of times of backchannel feedbacks of the second speaker per speech duration of the first speaker as a backchannel frequency of the second speaker. The backchannel frequency calculation unit 503 sets a certain unit of time to be one frame and calculates the backchannel frequency based on the speech duration calculated from the voice section of the first speaker within a frame and the number of times of backchannel feedbacks calculated from the backchannel section of the second speaker.
The average backchannel frequency estimation unit 504 estimates an average backchannel frequency of the second speaker based on the voice signals of the first and second speakers. The average backchannel frequency estimation unit 504 according to the present embodiment calculates an average of the backchannel frequency in a time period in which a prescribed number of frames have elapsed from the voice start time of the voice signals of the second speaker as an estimated value of an average backchannel frequency of the second speaker.
The determination unit 505 determines the satisfaction level of the second speaker, which is in other words, whether or not the second speaker is satisfied, based on the backchannel frequency calculated in the backchannel frequency calculation unit 503 and the average backchannel frequency calculated (estimated) in the average backchannel frequency estimation unit 504.
The warning output unit 506 has the display unit 204 of the first phone set 2 and the display device 6 connected to the utterance condition determination device 5 display a warning when the determinations that the second speaker is not satisfied (i.e., in a state of dissatisfaction) are made a prescribed number of times or more consecutively in the determination unit 505.
In the detection of a voice section and the detection of a backchannel section in the utterance condition determination device 5, for example, processing for each sample n in the voice signal, sectional processing for every time t1, and frame processing for every time t2 are performed as illustrated in
The voice section detection unit 501 uses amplitude s1(n) of each sample in the voice signal of the first speaker and calculates power p1(L) of the voice signal within the section L by using the following formula (1).
In the formula (1), Nis the number of samples within the section L.
Next, The voice section detection unit 501 compares the power p1(L) with a predetermined threshold TH and detects the section L that is power p1(L)≧TH as a voice section. The voice section detection unit 501 outputs u1(L) provided from the following formula (2) as a detection result.
The backchannel section detection unit 502 extracts an utterance section by performing morphological analysis using amplitude s2(n) of each sample in the voice signal of the second speaker. Next, the backchannel section detection unit 502 compares the extracted utterance section with the backchannel data registered in the backchannel dictionary and detects a section in the utterance section that matches the backchannel data as an utterance section. The backchannel section detection unit 502 outputs u2(L) provided from the following formula (3) as a detection result.
Base on the detection result of the voice section and the detection result of the backchannel section within mth frame, the backchannel frequency calculation unit 503 calculates a backchannel frequency IA (m) provided from the following formula (4).
In the formula (4), startj and endj is the start time and the end time, respectively, of a section in the voice section in which the detection result u1(L) is 1. In other words, startj is a point in time at which the detection result u1(n) for each sample rises from 0 to 1, and endj is a point in time at which the detection result u1(n) for each sample falls from 1 to 0. In the formula (4), cntA(m) is the number of sections in which the detection result u2(L) in the backchannel section is 1. In other words, cntA(m) is the number of times that the detection result u2(n) for each sample rises from 0 to 1.
The average backchannel frequency estimation unit 504 calculates an average JA of the backchannel frequency per time unit (one frame) provided from the following formula (5) as an average backchannel frequency by using the backchannel frequency IA(m) in a prescribed number of frames F1 from the voice start time of the second speaker.
The determination unit 505 outputs a determination result v(m) based on the criterion formula provided in the following formula (6).
In the formula (6), v(m)=1 indicates that a person at the other end of the line is satisfied, and v (m)=0 indicates that a person at the other end of the line is dissatisfied. In addition, β in the formula (6) represents a collection coefficient (e.g., β=0.7).
The warning output unit 506 obtains the determination result v(m) of the determination unit 505 and outputs a warning signal when the results v(m)=0 are obtained in two or more consecutive frames. The warning output unit 506 outputs the second determination result e(m) provided from the following formula (7) as an example of the warning signal.
The utterance condition determination device 5 according to the present embodiment performs the processing illustrated in
The utterance condition determination device 5 starts monitoring the voice signals between the first and second speakers (step S100). Step S100 is performed by a monitoring unit (not illustrated) provided in the utterance condition determination device 5. The monitoring unit monitors the voice signal of the first speaker transmitted from the microphone 201 to the voice call processor 202 and the voice signal of the second speaker transmitted from the voice call processor 202 to the receiver 203. The monitoring unit outputs the voice signal of the first speaker to the voice section detection unit 501 and the average backchannel frequency estimation unit 504 and also outputs the voice signal of the second speaker to the backchannel section detection unit 502 and the average backchannel frequency estimation unit 504.
Next, the utterance condition determination device performs the average backchannel frequency estimation processing (step S101). Step S101 is performed by the average backchannel frequency estimation unit 504. The average backchannel frequency estimation unit 504 calculates a backchannel frequency IA(m) in two frames (60 seconds) from the voice start time of the voice signal of the second speaker by using the formulae (1) to (4) as an example. Afterwards, the average backchannel frequency estimation unit 504 outputs to the determination unit 505 an average JA of the backchannel frequency per one frame calculated by using the formula (5) as an average backchannel frequency.
After calculating the average backchannel frequency JA, the utterance condition determination unit 5 performs processing to detect a voice section from the voice signal of the first speaker (step S102) and processing to detect a backchannel section from the voice signal of the second speaker (step S103). Step S102 is performed by the voice section detection unit 501. The voice section detection unit 501 calculates the detection result u1(L) of a voice section in the voice signal of the first speaker by using the formulae (1) and (2). The voice section detection unit 501 outputs the detection result u1(L) of the voice section to the backchannel frequency calculation unit 503. On the other hand, step S103 is performed by the backchannel section detection unit 502. The backchannel section detection unit 502, after detecting a backchannel section by the above-described morphological analysis etc., calculates the detection result u2(L) of the backchannel section by using the formula (3). The backchannel section detection unit 502 outputs the detection result u2(L) of the backchannel section to the backchannel frequency calculation unit 503.
Note that in the flowchart in
The utterance condition determination device 5, next, calculates the backchannel frequency of the second speaker based on the voice section of the first speaker and the backchannel section of the second speaker (step S104). Step S104 is performed by the backchannel frequency calculation unit 503. The backchannel frequency calculation unit 503 calculates the backchannel frequency IA(m) of the second speaker in the mth frame by using the formula (4). The backchannel frequency calculation unit 503 outputs the calculated backchannel frequency IA(m) to the determination unit 505.
The utterance condition determination device 5 determines the satisfaction level of the second speaker based on the average backchannel frequency JA and the backchannel frequency IA(m) of the second speaker and outputs the determination result to the display unit and the warning output nit (step S105). Step S105 is performed by the determination unit 505. The determination unit 505 calculates a determination result v(m) by using the formula (6) and outputs the determination result v(m) to the display unit 204 and the warning output unit 506.
The utterance condition determination device 5 decides whether or not the determinations that the second speaker is dissatisfied (determinations of dissatisfaction) were consecutively made in the determination unit 505 (step S106). Step S106 is performed by the warning output unit 506. The warning output unit 506 stores a value of the determination result v(m−1) in the m−1th frame and calculates the second determination result e(m) provided from the formula (7) based on v(m) and v(m−1). When e(m)=1, the warning output unit 506 decides that the determinations of dissatisfaction were consecutively made in the determination unit 505.
When the determinations of dissatisfaction were consecutively made in the determination unit 505 (step S106; YES), the warning output unit 506 outputs a warning signal to the display unit 204 and the display device 6 (step S107). On the other hand, when the determinations of dissatisfaction were not consecutively made in the determination unit 505 (step S106; NO), the warning output unit 506 skips the processing in step S107.
Afterwards, the utterance condition determination device 5 decides whether or not the processing is continued (step S108). When the processing is continued (Step S108; YES), the utterance condition determination device 5 repeats the processing in Step S102 and the subsequent steps. When the processing is not continued (step S108; NO), the utterance condition determination device 5 ends the monitoring of the voice signals of the first and second speakers and ends the processing.
Note that while the utterance condition determination device 5 performs the above-described processing, the display unit 204 of the first phone set 2 and the display device 6 display the satisfaction level of the second speaker and other matters. At the time of starting a voice call, the display unit 204 of the first phone set 2 and the display device 6 display that the second speaker does not feed dissatisfied, and the displays in accordance with the determination result v(m) of the determination unit 505 are provided afterward. When the warning signal is output from the warning output unit 506, the display unit 204 of the first phone set 2 and the display device 6 switches the display related to the satisfaction level of the second speaker to a display in accordance with the warning signal.
The average backchannel frequency estimation unit 504 of the utterance condition determination device 5 according to the present embodiment performs the processing illustrated in
The average backchannel frequency estimation unit 504 performs processing to detect a voice section from a voice signal of the first speaker (step S101a) and processing to detect a backchannel section from a voice signal of the second speaker (step S101b). In the processing in step S101a, the average backchannel frequency estimation unit 504 calculates a detection result u1(L) of a voice section in the voice signal of the first speaker by using the formulae (1) and (2). In the processing in step S101b, the average backchannel frequency estimation unit 504, after detecting a backchannel section by the above-described morphological analysis etc., calculates a detection result u2(L) of the backchannel section by using the formula (3).
Note that in the flowchart in
The average backchannel frequency estimation unit 504, next, calculates a backchannel frequency IA(m) of the second speaker based on the voice section of the first speaker and the backchannel section of the second speaker (step S101c). In the processing in step S101c, the average backchannel frequency estimation unit 504 calculates a backchannel frequency IA(m) of the second speaker in the mth frame by using the formula (4).
Afterwards, the average backchannel frequency estimation unit 504 checks whether or not the backchannel frequency in a prescribed number of frames F1 from the voice start time of the second speaker is calculated (step S101d). When the backchannel frequency in the prescribed number of frames (e.g., F1=2) is not calculated (step S101d; NO), the average backchannel frequency estimation unit 504 repeats the processing in steps S101a to S101c. When the backchannel frequency in the prescribed number of frames is calculated (step S101d; YES), the average backchannel frequency estimation unit 504 calculates an average JA of the backchannel frequency of the second speaker from the backchannel frequency in a prescribed number of frames (step S101e). In the processing in step S101e, the average backchannel frequency estimation unit 504 calculates an average JA of the backchannel frequency per one frame by using the formula (5). After calculating the average JA of the backchannel frequency, the average backchannel frequency estimation unit 504 outputs the average JA of the backchannel frequency to the determination unit 505 as an average backchannel frequency and ends the average backchannel frequency estimation processing.
As described above, Embodiment 1 calculates an average JA of the backchannel frequency in voice signals in a prescribed number of frames (e.g., 60 seconds) from the voice start time of the second speaker as an average backchannel frequency and determines whether or not the second speaker is satisfied on the basis of this average backchannel frequency. During a prescribed number of frames from the voice start time, i.e., immediately after the voice call is started, the second speaker is estimated to be in a normal condition. Therefore, the backchannel frequency of the second speaker during a prescribed number of frames from the voice start time can be regarded as a backchannel frequency of the second speaker in a normal condition. As a result, according to Embodiment 1, it is possible to determine whether or not the second speaker is satisfied in consideration of an average backchannel frequency that is unique to the second speaker and it is therefore also possible to improve accuracy in determination of emotional conditions of a speaker based on a way of giving backchannel feedback.
Note that the utterance condition determination device 5 according to the present embodiment may be applied not only to the voice call system 100 that uses the IP network 4 as illustrated in
In addition, the average backchannel frequency estimation unit 504 in the utterance condition determination device 5 illustrated in
The first phone set 2 includes a microphone 201, a voice call processor 202, and a receiver 203. Note that the number of the first phone set 2 is not limited to only one, but plural sets can be included. The second phone set 3 is a phone set that can be connected to the first phone set 2 via the IP network 4. The second phone set 3 includes a microphone 301, a voice call processor 302, and a receiver 303.
The splitter 8 splits the voice signal of the first speaker transmitted from the voice call processor 202 of the first phone set 2 to the second phone set 3 and the voice signal of the second speaker transmitted from the second phone set 3 to the voice call processor 202 of the first phone set 2 and inputs the split signal to the response evaluation device 9. The splitter 8 is provided on a transmission path between the first phone set 2 and the IP network 4.
The response evaluation device 9 is a device that determines the satisfaction level of the second speaker (the opposing speaker of the first speaker) by using an utterance condition determination device 5. The response evaluation device 9 includes a receiver unit 901, a decoder 902, a display unit 903, and the utterance condition determination device 5.
The receiver unit 901 receives voice signals of the first and second speakers split by the splitter 8. The decoder 902 decodes the received voice signals of the first and second speakers to analog signals. The utterance condition determination device 5 determines the utterance conditions of the second speaker, i.e., whether or not the second speaker is satisfied, based on the decoded voice signals of the first and second speakers. The display unit 903 displays a determination result etc. of the utterance condition determination device 5.
In this voice call system 110, similarly to the voice call system 100 according to Embodiment 1, a voice call using the phone sets 2 and 3 becomes available by making a call connection between the first phone set 2 and the second phone set 3 in accordance with SIP.
The voice section detection unit 511 detects a voice section in voice signals of the first speaker. Similarly to the voice section detection unit 501 of the utterance condition determination device 5 according to Embodiment 1, the voice section detection unit 511 detects a section in which the power obtained from a voice signal is at or above a certain threshold TH from among the voice signals of the first speaker as a voice section.
The backchannel section detection unit 512 detects a backchannel section in voice signals of the second speaker. Similarly to the backchannel section detection unit 502 of the utterance condition determination device 5 according to Embodiment 1, the backchannel section detection unit 512 performs morphological analysis of the voice signals of the second speaker and detects a section that matches any piece of backchannel data registered in a backchannel dictionary as a backchannel section.
The backchannel frequency calculation unit 513 calculates the number of times of backchannel feedbacks of the second speaker per speech duration of the first speaker as a backchannel frequency of the second speaker. The backchannel frequency calculation unit 513 sets a certain unit of time to be one frame and calculates the backchannel frequency based on the speech duration calculated from the voice section of the first speaker within a frame and the number of times of backchannel feedbacks calculated from the backchannel section of the second speaker. Note that the backchannel frequency calculation unit 513 in the utterance condition determination device 5 according to the present embodiment calculates a backchannel frequency IB(m) provided from the following formula (8) by using the detection result of the voice section and the detection result of the backchannel section within mth frame.
In the formula (8), similarly to the formula (4), startj and endj is the start time and the end time, respectively, of a section in the voice section in which the detection result u1(L) is 1. In other words, the start time startj is a point in time at which the detection result u1(n) for each sample rises from 0 to 1, and the end time endj is a point in time at which the detection result u1(n) for each sample falls from 1 to 0. In the formula (8), cntB(m) is the number of times of the backchannel feedbacks calculated from the number of backchannel sections of the second speaker detected between the start time startj and the end time endj in the voice section of the first speaker in the mth frame.
The average backchannel frequency estimation unit 514 estimates an average backchannel frequency of the second speaker. Note that the average backchannel frequency estimation unit 514 according to the present embodiment calculates an average JB of the backchannel frequency provided from an update equation of the following formula (9) as an estimated value of the average backchannel frequency of the second speaker.
JB(m)=ε·JB(m−1)+(1−ε)·IB(m) (9)
In the formula (9), ε represents an update coefficient and can be any value of 0<ε<1(e.g., ε=0.9). Additionally, JB(0)=0.1 is given.
The determination unit 515 determines the satisfaction level of the second speaker, i.e., whether or not the second speaker is satisfied, based on the backchannel frequency IB(m) calculated in the backchannel frequency calculation unit 513 and the average backchannel frequency JB(m) calculated (estimated) in the average backchannel frequency estimation unit 514. The determination unit 515 outputs a determination result v(m) based on the criterion formula provided in the following formula (10).
The sentence output unit 516 reads out a sentence corresponding to the determination result v(m) of the satisfaction level in the determination unit 515 from the storage unit 517 and has the display unit 903 display the sentence.
The determination result v(m) of the satisfaction level according to the present embodiment is either one of two values 0 and 1, as provided in the formula (10). Therefore, the storage unit 517 stores two types of sentences w(m) including a sentence displayed when v(m)=0 and a sentence displayed when v(m)=1, as illustrated in
The utterance condition determination device 5 according to the present embodiment performs the processing illustrated in
The utterance condition determination device 5 starts acquiring a voice signal of the first and second speakers (step S200). Step S200 is performed by an acquisition unit (not illustrated) provided in the utterance condition determination device 5. The acquisition unit acquires the voice signal of the first speaker and the voice signal of the second speaker input to the utterance condition determination device 5 from the splitter 8. The acquisition unit outputs the voice signal of the first speaker to the voice section detection unit 511 and the average backchannel frequency estimation unit 514 and also outputs the voice signal of the second speaker to the backchannel section detection unit 512 and the average backchannel frequency estimation unit 514.
Next, the utterance condition determination device 5 performs the average backchannel frequency estimation processing (step S201). Step S201 is performed by the average backchannel frequency estimation unit 514. The average backchannel frequency estimation unit 514 calculates a backchannel frequency IB(m) of the voice signal of the second speaker by using the formulae (1) to (3) and (8) as an example. Afterwards, the average backchannel frequency estimation unit 514 calculates an average JB(m) of the backchannel frequency by using the formula (9) and outputs to the determination unit 515 the calculated average JB(m) of the backchannel frequency as an average backchannel frequency.
After calculating the average backchannel frequency JB(m), the utterance condition determination unit 5 performs processing to detect a voice section from the voice signal of the first speaker (step S202) and processing to detect a backchannel section from the voice signal of the second speaker (step S203). Step S202 is performed by the voice section detection unit 511. The voice section detection unit 511 calculates the detection result u1(L) of a voice section in the voice signal of the first speaker by using the formulae (1) and (2). The voice section detection unit 511 outputs the detection result u1(L) of the voice section to the backchannel frequency calculation unit 513. On the other hand, step S203 is performed by the backchannel section detection unit 512. The backchannel section detection unit 512, after detecting a backchannel section by the above-described morphological analysis etc., calculates the detection result u2(L) of the backchannel section by using the formula (3). The backchannel section detection unit 512 outputs the detection result u2(L) of the backchannel section to the backchannel frequency calculation unit 513.
When the processing in step S202 and S203 is ended, the utterance condition determination device 5, next, calculates the backchannel frequency of the second speaker based on the voice section of the first speaker and the backchannel section of the second speaker (step S204). Step S204 is performed by the backchannel frequency calculation unit 513. The backchannel frequency calculation unit 513 calculates the backchannel frequency IB(m) of the second speaker in the mth frame by using the formula (8).
Note that in the flowchart in
When the processing in steps S201 to S204 is ended, the utterance condition determination device 5 determines the satisfaction level of the second speaker based on the average backchannel frequency JB(m) and the backchannel frequency IB (m) of the second speaker and outputs a determination result to the display unit and the sentence output unit (step S205). Step S205 is performed by the determination unit 515. The determination unit 515 calculates a determination result v(m) by using the formula (10) and outputs the determination result v(m) to the display unit 903 and the sentence output unit 516.
The utterance condition determination device 5 extracts a sentence corresponding to the determination result v(m) and have the display unit 903 display the sentence (step S206). Step S206 is performed by the sentence output unit 516. The sentence output unit 516 extracts a sentence w(m) corresponding to the determination result v(m) by referencing a sentence table (see
Afterwards, the utterance condition determination device 5 decides whether or not to continue the processing (step S207). When the processing is continued (step S207; YES), the utterance condition determination device 5 repeats the processing in step S201 and subsequent steps. When the processing is not continued (step S207; NO), the utterance condition determination device 5 ends the acquisition of the voice signal of the first and second speakers and ends the processing.
The average backchannel frequency estimation unit 514 of the utterance condition determination device 5 according to the present embodiment performs the processing illustrated in
The average backchannel frequency estimation unit 514 performs processing to detect a voice section from a voice signal of the first speaker (step S201a) and processing to detect a backchannel section from a voice signal of the second speaker (step S201b). In the processing in step S201a, the average backchannel frequency estimation unit 514 calculates a detection result u1(L) of a voice section in the voice signal of the first speaker by using the formulae (1) and (2). In the processing in step S201b, the average backchannel frequency estimation unit 514, after detecting a backchannel section by the above-described morphological analysis etc., calculates a detection result u2(L) of the backchannel section by using the formula (3).
Note that in the flowchart in
After the processing in step S201a and S201b is ended, the average backchannel frequency estimation unit 514, next, calculates a backchannel frequency IB (m) of the second speaker based on the voice section of the first speaker and the backchannel section of the second speaker (step S201c). In the processing in step S201c, the average backchannel frequency estimation unit 514 calculates a backchannel frequency IB(m) of the second speaker in the mth frame by using the formula (8).
Next, the average backchannel frequency estimation unit 514 calculates an average JB(m) of the backchannel frequency of the second speaker in the current frame by using a backchannel frequency IB(m) of the current frame and an average JB(m−1) of the backchannel frequency of the second speaker in the frame before the current frame (step S201d). In the processing in step S201d, the average backchannel frequency estimation unit 514 calculates an average backchannel frequency JB(m) in the current frame (the mth frame) by using the formula (9).
Afterwards, the average backchannel frequency estimation unit 514 outputs the average JB(m) of the backchannel frequency calculated in step S201d to the determination unit 515 as an average backchannel frequency and stores the average JB(m) of the backchannel frequency (step S201e), and the average backchannel frequency estimation unit 514 ends the average backchannel frequency estimation processing.
As described above, also in Embodiment 2, the satisfaction level of the second speaker is determined on the basis of the average backchannel frequency JB(m) and the backchannel frequency IB(m) calculated from the voice signal of the second speaker. Therefore, similarly to Embodiment 1, it is possible to determine whether or not the second speaker is satisfied in consideration of an average backchannel frequency that is unique to the second speaker and it is therefore also possible to improve accuracy in determination of emotional conditions of a speaker based on a way of giving backchannel feedback.
Note that the utterance condition determination device 5 according to the present embodiment may be applied not only to the voice call system 110 that uses the IP network 4 as illustrated in
In addition, the average backchannel frequency estimation unit 514 in the utterance condition determination device 5 illustrated in
Moreover, the utterance condition determination device 5 according to the present embodiment determines the satisfaction level of the second speaker based on the backchannel frequency IB(m) calculated by using the formulae (1) to (3) and (8) and the average backchannel frequency JB(m) calculated by using the backchannel frequency IB(m). However, the configuration of the utterance condition determination device 5 in the response evaluation device 9 illustrated in
The first phone set 2 includes a microphone 201, a voice call processor 202, and a receiver 203. The second phone set 3 is a phone set that can be connected to the first phone set 2 via the IP network 4. The second phone set 3 includes a microphone 301, a voice call processor 302, and a receiver 303.
The splitter 8 splits the voice signal of the first speaker transmitted from the voice call processor 202 of the first phone set 2 to the second phone set 3 and the voice signal of the second speaker transmitted from the second phone set 3 to the voice call processor 202 of the first phone set 2 and inputs the split signal to the server 10. The splitter 8 is provided on a transmission path between the first phone set 2 and the IP network 4.
The server 10 is a device that makes the voice signals of the first and second speakers that is input via the splitter 8 into a voice file, stores the file, and determines the satisfaction level of the second speaker (the opposing speaker of the first speaker) when necessary. The server 10 includes a voice processor unit 1001, a storage unit 1002, and the utterance condition determination device 5. The voice processor unit 1001 performs processing of generating a voice file from the voice signals of the first and second speakers. The storage unit 1002 stores the generated voice file of the first and second speakers. The utterance condition determination device 5 determines the satisfaction level of the second speaker by reading out the voice file of the first and second speakers.
The reproduction device 11 is a device to read out and reproduce a voice file of the first and second speakers stored in the storage unit 1002 of the server 10 and to display the determination result of the utterance condition determination device 5.
As illustrated in
The receiver unit 1001a receives voice signals of the first and second speakers split by the splitter 8. The decoder 1001b decodes the received voice signals of the first and second speakers to analog signals. The voice filing processor unit 1001c generates electronic files (voice files) of the voice signals of the first and second speakers decoded in the decoder 1001b, respectively, associates the voice file of each, and stores the files in the storage unit 1002.
The storage unit 1002 stores the voice files of the first and second speaker associated with each other for each voice call. The voice files stored in the storage unit 1002 is transferred to the reproduction device 11 in response to a read request from the reproduction device 11. In the following descriptions, the voice files of the first and second speakers may be referred to as voice signals.
The utterance condition determination device 5 reads out the voice files of the first and second speakers stored in the storage unit 1002, determines the utterance condition of the second speaker, i.e., whether or not the second speaker is satisfied, and output the determination to the reproduction device 11. As illustrated in
The voice section detection unit 521 detects a voice section in voice signals of the first speaker. Similarly to the voice section detection unit 501 of the utterance condition determination device 5 according to Embodiment 1, the voice section detection unit 521 detects a section in which the power obtained from a voice signal is at or above a certain threshold TH from among the voice signals of the first speaker as a voice section.
The backchannel section detection unit 522 detects a backchannel section in voice signals of the second speaker. Similarly to the backchannel section detection unit 502 of the utterance condition determination device 5 according to Embodiment 1, the backchannel section detection unit 522 performs morphological analysis of the voice signals of the second speaker and detects a section that matches any piece of backchannel data registered in a backchannel dictionary as a backchannel section.
The backchannel frequency calculation unit 523 calculates the number of times of backchannel feedbacks of the second speaker per speech duration of the first speaker as a backchannel frequency of the second speaker. The backchannel frequency calculation unit 523 sets a certain unit of time to be one frame and calculates the backchannel frequency based on the speech duration calculated from the voice section of the first speaker within a frame and the number of times of backchannel feedbacks calculated from the backchannel section of the second speaker. Note that the backchannel frequency calculation unit 523 in the utterance condition determination device 5 according to the present embodiment calculates a backchannel frequency IC (m) provided from the following formula (11) by using the detection result of the voice section and the detection result of the backchannel section within mth frame.
In the formula (11), similarly to the formula (4), startj and endj is the start time and the end time, respectively, of a section in the voice section in which the detection result u1(L) is 1. In other words, the start time startj is a point in time at which the detection result u1(n) for each sample rises from 0 to 1, and the end time endj is a point in time at which the detection result u1(n) for each sample falls from 1 to 0. Furthermore, cntC(m) is the number of times of the backchannel feedbacks of the second speaker in a time period between the start time startj and the end time endj of the voice section of the first speaker and a time period within a certain period of time t immediately after the end time endj in the mth frame. The number of times of the backchannel feedbacks cntC(m) is calculated from the number of times that the detection result u2(n) of the backchannel section rises from 0 to 1 in the above time periods.
The average backchannel frequency estimation unit 524 estimates an average backchannel frequency of the second speaker. The average backchannel frequency estimation unit 524 according to the present embodiment calculates an average JC of the backchannel frequency provided from the following formula (12) as an estimated value of the average backchannel frequency of the second speaker.
In the formula (12), M is the frame number of the last (end time) frame in the voice signal of the second speaker. In other words, the average backchannel frequency JC is an average of the backchannel frequencies from the voice start time to the end time of the second speaker in units of frames.
The determination unit 525 determines the satisfaction level of the second speaker, i.e., whether or not the second speaker is satisfied, based on the backchannel frequency IC(m) calculated in the backchannel frequency calculation unit 523 and the average backchannel frequency JC calculated (estimated) in the average backchannel frequency estimation unit 524. The determination unit 525 outputs a determination result v(m) based on the criterion formula provided from the following formula (13).
In the formula (13), each of β1 and β2 is a correction coefficient, and β1=0.2 and β2=1.5 are given.
The overall satisfaction level calculation unit 526 calculates the overall satisfaction level V of the second speaker in a voice call between the first speaker and the second speaker. The overall satisfaction level calculation unit 526 calculates the overall satisfaction level V by using the following formula (14).
In the formula (14), c0, c1, and c2 are the number of frames in which v(m)=0, the number of frames in which v(m)=1, and the number of frames in which v(m)=2, respectively.
The sentence storage unit 527 reads out a sentence corresponding to the overall satisfaction level V calculated in the overall satisfaction level calculation unit 526 from the storage unit 528 and outputs the sentence to the reproduction device 11.
When detection of a voice section and detection of a backchannel section are performed in the utterance condition determination device 5 according to the present embodiment, for example, processing for every sample n of the voice signal, sectional processing for every time t1, and frame processing for every time t2 are performed as illustrated in
The sentence output unit 527 in the utterance condition determination device 5 according to the present embodiment reads out a sentence corresponding to the overall satisfaction level V from the storage unit 528 and outputs the sentence to the reproduction device 11 as described above. The overall satisfaction level V is a value calculated by using the formula (14) and is any value from 0 to 100. The overall satisfaction level V calculated by using the formula (14) is also a value that becomes larger as the value of c2, i.e., the number of frames in which v(m)=2, becomes larger. As a result, the overall satisfaction level V takes a larger value closer to 100 as the satisfaction level of the second speaker is higher. Therefore, from among the sentences stored in the storage unit 528, a sentence indicating that the second speaker feels dissatisfied is read out when the overall satisfaction level V is low, and a sentence indicating that the second speaker is satisfied is read out when the overall satisfaction level V is high. In the storage unit 528, five types of sentences w(m) that correspond to the levels of the overall satisfaction level V are stored as illustrated in
The operation unit 1101 is an input device such as a keyboard device and a mouse device that an operator of the reproduction device 11 operates and is used for an operation to select a voice call record to be reproduced and other operations.
The data acquisition unit 1102 acquires a voice file of the first and second speakers corresponding to the voice call record selected by the operation of the operation unit 1101 and also acquires a sentence etc. corresponding to the determination result of the satisfaction level or the overall satisfaction level in the utterance condition determination device 5 in relation to the acquired voice file. The data acquisition unit 1102 acquires a voice file of the first and second speakers from the storage unit 1002 of the server 10. The data acquisition unit 1102 also acquires the determination results etc. from the determination unit 525, the overall satisfaction level calculation unit 526, and the sentence output unit 527 of the utterance condition determination device 5.
The voice reproduction unit 1103 performs processing to convert the voice file (electronic file) of the first and second speaker acquired in the data acquisition unit 1102 into analog signals that can be output from the speaker 1104.
The display unit 1105 displays the sentence corresponding to the determination result of the satisfaction level or the overall satisfaction level V acquired in the data acquisition unit 1102.
The utterance condition determination device 5 according to the present embodiment performs the processing provided in
The utterance condition determination device 5 reads out a voice file of the first and second speakers from the storage unit 1002 of the server 10 (step S300). Step S300 is performed by an acquisition unit (not illustrated) provided in the utterance condition determination device 5. The acquisition unit acquires voice files of the first and second speaker that corresponds to a voice call record requested by the reproduction device 11. The acquisition unit outputs a voice file of the first speaker to the voice section detection unit 521 and the average backchannel frequency estimation unit 524 and outputs a voice file of the second speaker to the backchannel section detection unit 522 and the average backchannel frequency estimation unit 524.
Next, the utterance condition determination device performs the average backchannel frequency estimation processing (step S301). Step S301 is performed by the average backchannel frequency estimation unit 524. The average backchannel frequency estimation unit 524 calculates a backchannel frequency IC(m) of the second speaker by using the formulae (1) to (3) and (11) as an example. Afterwards, the average backchannel frequency estimation unit 524 calculates an average JC of the backchannel frequency by using the formula (12) and outputs to the determination unit 525 the calculated average JC of the backchannel frequency as an average backchannel frequency.
After calculating the average backchannel frequency JC, the utterance condition determination unit 5 performs processing to detect a voice section from the voice signal of the first speaker (step S302) and processing to detect a backchannel section from the voice signal of the second speaker (step S303). Step S302 is performed by the voice section detection unit 521. The voice section detection unit 521 calculates the detection result u1(L) of a voice section in the voice signal of the first speaker by using the formulae (1) and (2). The voice section detection unit 521 outputs the detection result u1(L) of the voice section to the backchannel frequency calculation unit 523. On the other hand, step S303 is performed by the backchannel section detection unit 522. The backchannel section detection unit 522, after detecting a backchannel section by the above-described morphological analysis etc., calculates the detection result u2(L) of the backchannel section by using the formula (3). The backchannel section detection unit 522 outputs the detection result u2(L) of the backchannel section to the backchannel frequency calculation unit 523.
Note that in the flowchart in
When the processing in step S302 and S303 is ended, the utterance condition determination device 5, next, calculates the backchannel frequency of the second speaker based on the voice section of the first speaker and the backchannel section of the second speaker (step S304). Step S304 is performed by the backchannel frequency calculation unit 523. The backchannel frequency calculation unit 523 calculates the backchannel frequency IC(m) of the second speaker in the mth frame by using the formula (11).
The utterance condition determination device 5, next, determines the satisfaction level of the second speaker in the frame m based on the average backchannel frequency JC and the backchannel frequency IC(m) of the second speaker and outputs a determination result to the reproduction device 11 (step S305). Step S305 is performed by the determination unit 525. The determination unit 525 calculates a determination result v(m) by using the formula (13) and outputs the determination result v(m) to the reproduction device 11 and the overall satisfaction calculation unit 526.
The utterance condition determination device 5 calculates the overall satisfaction level V by using the value of the determination result v(m) of the satisfaction level in each frame and outputs the overall satisfaction level V to the reproduction device 11 and the sentence output unit 527 (step S306). Step S306 is performed by the overall satisfaction level calculation unit 526. The overall satisfaction level calculation unit 526 calculates the overall satisfaction level V of the second speaker by using the formula (14).
The utterance condition determination device 5 reads out a sentence w(m) corresponding to the overall satisfaction level V from the storage unit 528 and outputs the sentence to the reproduction device 11 (step S307). Step S307 is performed by the sentence output unit 527. The sentence output unit 527 extracts a sentence w(m) corresponding to the overall satisfaction level V by referencing a sentence table (see
Afterwards, the utterance condition determination device 5 decides whether or not to continue the processing (step S308). When the processing is continued (step S308; YES), the utterance condition determination device 5 repeats the processing in step S302 and subsequent steps. When the processing is not continued (step S308; NO), the utterance condition determination device 5 ends the processing.
The average backchannel frequency estimation unit 524 of the utterance condition determination device 5 according to the present embodiment performs the processing illustrated in
The average backchannel frequency estimation unit 524 performs processing to detect a voice section from a voice signal of the first speaker (step S301a) and processing to detect a backchannel section from a voice signal of the second speaker (step S301b). In the processing in step S301a, the average backchannel frequency estimation unit 524 calculates a detection result u1(L) of a voice section in the voice signal of the first speaker by using the formulae (1) and (2). In the processing in step S301b, the average backchannel frequency estimation unit 524, after detecting a backchannel section by the above-described morphological analysis etc., calculates a detection result u2(L) of the backchannel section by using the formula (3).
Note that in the flowchart in
The average backchannel frequency estimation unit 524, next, calculates a backchannel frequency IC(m) of the second speaker based on the voice section of the first speaker and the backchannel section of the second speaker (step S301c). In the processing in step S301c, the average backchannel frequency estimation unit 524 calculates a backchannel frequency IC(m) of the second speaker in the mth frame by using the formula (11).
Next, the average backchannel frequency estimation unit 524 checks whether or not the backchannel frequency from the voice start time of the second speaker to the end time is calculated (step S301d). When the backchannel frequency from the voice start time to the end time is not calculated (step S301d; NO), the average backchannel frequency estimation unit 524 repeats the processing in steps S301a to S301c. When the backchannel frequency from the voice start time to the end time is calculated (step S301d; YES), the average backchannel frequency estimation unit 524, next, calculates an average JC of the backchannel frequency of the second speaker from the backchannel frequency from the voice start time to the end time (step S301e). In the processing in step S301e, the average backchannel frequency estimation unit 524 calculates an average JC of the backchannel frequency by using the formula (12). After calculating the average JC of the backchannel frequency, the average backchannel frequency estimation unit 524 outputs the calculated average JC of the backchannel frequency to the determination unit 525 as an average backchannel frequency and ends the average backchannel frequency estimation processing.
As described above, also in Embodiment 3, the satisfaction level of the second speaker is determined on the basis of the average backchannel frequency JC and the backchannel frequency IC(m) calculated from the voice signal of the second speaker. Therefore, similarly to Embodiment 1, it is possible to determine whether or not the second speaker is satisfied in consideration of an average backchannel frequency that is unique to the second speaker and it is therefore also possible to improve accuracy in determination of emotional conditions of a speaker based on a way of giving backchannel feedback.
Moreover, in Embodiment 3, because a voice call of the first and second speakers by using the first and second phone sets 2 and 3 is stored in the storage unit 1002 of the server 10 as a voice file (an electronic file), the voice file can be reproduced and listened to after the voice call ends. In Embodiment 3, the overall satisfaction level V of the second speaker is calculated during voice file reproduction and outputs a sentence corresponding to the overall satisfaction level V to the reproduction device 11. It is therefore possible to check the overall satisfaction level of the voice call and a sentence corresponding to the overall satisfaction level, in addition to the satisfaction level of the second speaker in each frame (section), in the display unit 1105 of the reproduction device 11 while the voice file is viewed after the voice call ends.
Note that the server 10 in the voice call system provided as an example in the present embodiment may be installed in any place that is not limited to a facility in which the first phone set 2 is installed and may be connected to the first phone set 2 or the reproduction device 11 via a communication network such as the Internet.
The first AD converter unit 1201 converts a voice signal collected by the first microphone 13A from an analog signal to a digital signal. The second AD converter unit 1202 converts a voice signal collected by the second microphone 13B from an analog signal to a digital signal. In the following descriptions, the voice signal collected by the first microphone 13A is a voice signal of the first speaker and the voice signal collected by the second microphone 13B is a voice signal of the second speaker.
The voice filing processor unit 1203 generates an electronic file (a voice file) of the voice signal of the first speaker converted by the first AD converter unit 1201 and the voice signal of the second speaker converted by the second AD converter unit 1202, associates these voice files with each other, and stores the files in the storage unit 1206.
The utterance condition determination device 5 determines the utterance condition (the satisfaction level) of the second speaker by using, for example, the voice signal of the first speaker converted by the first AD converter 1201 and the voice signal of the second speaker converted by the second AD converter 1202. The utterance condition determination device 5 also associates the determination result with a voice file generated by the voice filing processor unit 1203 and store the determination result in the storage device 1206.
The operation unit 1204 is a button switch etc. used for operating the recording device 12. For example, when an operator of the recording device 12 starts recording by operating the operation unit 1204, a start command of prescribed processing is input from the operation unit 1204 to each of the voice filing processor unit 1203 and the utterance condition determination device 5.
The display unit 1205 displays the determination result (the satisfaction level of the second speaker) etc. of the utterance condition determination device 5.
The storage device 1206 is a device to store voice files of the first and second speakers, the satisfaction level of the second speaker and so forth. Note that the storage device 1206 may be constructed from a portable recording medium such as a memory card and a recording medium drive unit that can read data from and write data in the recording medium.
The voice section detection unit 531 detects a voice section in the voice signals of the first speaker (voice signals of a speaker collected by the first microphone 13A). Similarly to the voice section detection unit 501 of the utterance condition determination device 5 according to Embodiment 1, the voice section detection unit 531 detects a section in which the power obtained from a voice signal is at or above a certain threshold TH from among the voice signals of the first speaker as a voice section.
The backchannel section detection unit 532 detects a backchannel section in voice signals of the second speaker (voice signals of a speaker collected by the second microphone 13B). Similarly to the backchannel section detection unit 502 of the utterance condition determination device 5 according to Embodiment 1, the backchannel section detection unit 532 performs morphological analysis of the voice signals of the second speaker and detects a section that matches any piece of backchannel data registered in a backchannel dictionary as a backchannel section.
The feature amount calculation unit 533 calculates a vowel type h(L) and an amount of pitch shift df(L) based on the voice signals of the second speaker and the backchannel section detected by the backchannel section detection unit 532. The vowel type h(L) is calculated, for example, by a method described in Non-Patent Document 1. The amount of pitch shift df(L) is calculated, for example, by the following formula (15).
df(L)=f(L)−f(L−1) (15)
In the formula (15), f (L) is a pitch within a section L and can be calculated by a known method such as pitch detection by autocorrelation or cepstrum analysis of the section.
The backchannel frequency calculation unit 534 sorts backchannel feedbacks into two conditions, affirmative and negative, based on the vowel type h(L) and the amount of pitch shift df(L) and calculates the backchannel frequency ID(m) provided by the following formula (16).
In the formula (16), startj and endj are the start time and the end time, respectively, of a voice section of the first speaker explained in Embodiment 1. In the formula (16), cnt0(m) and cnt1(m) are the number of times of backchannel feedbacks calculated by using backchannel sections in an affirmative condition and the number of times of backchannel feedbacks calculated by using backchannel sections in a negative condition, respectively. In addition, in the formula (16), μ0 and μ1 are weighting coefficients and μ0=0.8 and μ1=1.2 are given. Note backchannel feedbacks are sorted into affirmative or negative by referencing backchannel intension determination information stored in the first storage unit 535.
The average backchannel frequency estimation unit 536 estimates an average backchannel frequency of the second speaker. The average backchannel frequency estimation unit 536 according to the present embodiment calculates a value JD corresponding to an speech rate r in a time period in which a prescribed number of frames have elapsed from the voice start time of the second speaker as an estimation value of the average backchannel frequency of the second speaker. The speech rate r is calculated by using a known method (e.g., a method described in Patent Document 4). After calculating the speech rate r, the average backchannel frequency estimation unit 536 calculates an average backchannel frequency JD of the second speaker by referencing a correspondence table of the speech rate r and the average backchannel frequency JD stored in the second storage unit 537. The average backchannel frequency estimation unit 536 calculates the average backchannel frequency JD every time a change is made to speaker information info2(n) of the second speaker. The speaker information info2(n) is input from the operation unit 1204 as an example.
The determination unit 538 determines the satisfaction level of the second speaker, i.e., whether or not the second speaker is satisfied, based on the backchannel ID(m) calculated in the backchannel frequency calculation unit 534 and the average backchannel frequency JD calculated (estimated) in the average backchannel frequency estimation unit 536. The determination unit 538 outputs a determination result v(m) based on the criterion formula provided in the following formula (17).
In the formula (17), β1 and β2 are correction coefficients and β1=0.2 and β2=1.5 are provided as an example.
The response score output unit 539 calculates a response score v′(m) in each frame by using the following formula (18).
The response score output unit 539 outputs the calculated response score v′(m) to the display unit 1205 and has the storage device 1206 store the response score in association with the voice file generated in the voice filing processor unit 1203.
While Embodiment 1 through Embodiment 3 calculate the average backchannel frequency based on the backchannel frequency, the present embodiment calculates the average backchannel frequency JD based on the speech rate r as described above.
A speaker of a high speech rate (i.e., a fast speaker) tends to have shorter intervals of backchannel feedbacks and therefore makes backchannel feedbacks more frequently compared with a speaker of a low speech rate. For that reason, as in the correspondence table provided in
The utterance condition determination device 5 according to the present embodiment performs the processing provided in
The utterance condition determination device 5 starts monitoring voice signals of the first and second speakers (step S400). Step S400 is performed by a monitoring unit (not illustrated) provided in the utterance condition determination device 5. The monitoring unit monitors the voice signals of the first speaker and the voice signals of the second speaker transmitted from the first AD converter 1201 and the second AD converter 1202, respectively, to the voice filing processor unit 1203. The monitoring unit outputs the voice signals of the first speaker to the voice section detection unit 531 and the average backchannel frequency estimation unit 536. The monitoring unit also outputs the voice signals of the second speaker to the backchannel section detection unit 532, the feature amount calculation unit 533, and the average backchannel frequency estimation unit 536.
The utterance condition determination device 5, next, performs the average backchannel frequency estimation processing (step S401). Step S401 performs the average backchannel frequency estimation unit 536. The average backchannel frequency estimation unit 536 calculates an speech rate r of the second speaker based on the voice signals for two frames (60 seconds) from the voice start time of the second speaker as an example. The speech rate r is calculated by any known calculation method (e.g., a method described in Patent Document 4). Afterwards, the average backchannel frequency estimation unit 536 references the correspondence table stored in the second storage unit 537 and outputs the average backchannel frequency JD corresponding to the speech rate r to the determination unit 538 as an average backchannel frequency of the second speaker.
After calculating the average backchannel frequency JD, the utterance condition determination device 5, next, performs processing to detect a voice section from the voice file of the first speaker (step S402) and processing to detect a backchannel section form the voice file of the second speaker (step S403). Step S402 is performed by the voice section detection unit 531. The voice section detection unit 531 calculates a detection result u1(L) of a voice section in the voice signal of the first speaker by using the formulae (1) and (2) and outputs the detection result u1(L) of the voice section to the backchannel frequency calculation unit 534. Step S403 is performed by the backchannel section detection unit 532. The backchannel section detection unit 532, after detecting a backchannel section by the above-described morphological analysis etc., calculates the detection result u2(L) of the backchannel section by using the formula (3) and outputs the detection result u2(L) of the backchannel section to the backchannel frequency calculation unit 534.
After the detection of a backchannel section, the utterance condition determination device 5, next, calculates a feature amount of the backchannel section in the voice file of the second speaker (step S404). Step S404 is performed by the feature amount calculation unit 533. The feature amount calculation unit 533 calculates the vowel type h(L) and the amount of pitch shift df(L) as a feature amount of the backchannel section. The vowel type h(L) is calculated by any known calculation method (e.g., a method described in Non-Patent Document 1) by using the detection result u2(L) of the backchannel section of the backchannel section detection unit 532. The amount of pitch shift df(L) is calculated by using the formula (15). The feature amount calculation unit 533 outputs the calculated feature amount, i.e., the vowel type h(L) and the amount of pitch shift df(L), to the backchannel frequency calculation unit 534.
Note that in the flowchart in
After the processing in steps S402 through S404, the utterance condition determination device 5, next, calculates a backchannel frequency of the second speaker based on the voice section of the first speaker and the backchannel section and the feature amount of the second speaker (step S405). Step S405 is performed by the backchannel frequency calculation unit 534. In step S405, the backchannel frequency calculation unit 534 obtains the number of times of affirmative backchannel feedbacks cnt0(m) and the number of times of negative backchannel feedbacks cnt1(m) based on the backchannel intension determination information in the first storage unit 535 and the feature amount calculated in step S404. Afterwards, the backchannel frequency calculation unit 534 calculates the backchannel frequency ID(m) of the second speaker in the mth frame by using the formula (16) and outputs the backchannel frequency ID(m) to the determination unit 538.
Next, the utterance condition determination device 5 determines the satisfaction level of the second speaker based on the average backchannel frequency JD and the backchannel frequency ID(m) of the second speaker (step S406). Step S406 is performed by the determination unit 538. The determination unit 538 calculates the determination result v(m) by using the formula (17). The determination unit 538 outputs the determination result v(m) to the response score output unit 539 as the satisfaction level of the second speaker.
Next, the utterance condition determination device 5 calculates the response score of the first speaker based on the determination result of the satisfaction level of the second speaker and outputs the calculated response score (step S407). Step S407 is performed by the response score output unit 539. The response score output unit 539 calculates a response score v′(m) by using the determination result v(m) of the determination unit 538 and the formula (18). The response score output unit 539 has the display unit 1205 display the calculated response score v′ (m) and also has the storage device 1206 store the response score.
After outputting the response score v′ (m), the utterance condition determination device 5 determines whether or not to continue the processing (step S408). When the processing is not continued (step S408; NO), the utterance condition determination device 5 ends the monitoring of the voice signals of the first and second speakers and ends the processing.
On the other hand, when the processing is continued (step S408; YES), the utterance condition determination device 5, next, checks whether or not a change has been made to speaker information of the second speaker (step S409). When no change has been made to the speaker information info2(n) (step S409; NO), the utterance condition determination device 5 repeats the processing in step S402 and subsequent steps. When a change has been made to the speaker information info2(n) (step S409; YES), the utterance condition determination device 5 brings the processing back to step S401, calculates the average backchannel frequency JD for the changed second speaker and performs the processing in step S402 and subsequent steps.
A described above, in Embodiment 4, the satisfaction level of the second speaker can be indirectly obtained by calculating the response score v′ (m) of the first speaker based on the average backchannel frequency JD and the backchannel frequency ID (m) calculated from the voice signals of the second speaker.
In addition, because the average backchannel frequency JD is calculated in accordance with the speech rate r of the second speaker in Embodiment 4, the average backchannel frequency can be calculated appropriately even though the second speaker is, for example, a speaker who infrequently gives backchannel feedback by nature.
Moreover, in Embodiment 4, backchannel feedbacks are sorted into affirmative backchannel feedbacks and negative backchannel feedbacks in accordance with the vowel type h(L) and the amount of pitch shift df(L) calculated in the feature amount calculation unit 533 and the backchannel frequency ID (m) is calculated on the basis of the sorting. For that reason, the backchannel frequency ID(m) in Embodiment 4 changes its value in response to the number of times of the affirmative backchannel feedbacks even though the number of times of the backchannel feedbacks in one frame is the same. It is therefore possible to determine whether or not the second speaker is satisfied on the basis of whether the backchannel feedbacks are affirmative or negative even though the second speaker is a speaker who infrequently gives backchannel feedback by nature.
Note that the utterance condition determination device 5 according to the present embodiment can be applied not only to the recording device 12 illustrated in
The recording device 15 includes the first AD converter unit 1501, the second AD converter unit 1502, a voice filing processor 1503, an operation unit 1504, and a display unit 1505.
The first AD converter unit 1501 converts a voice signal collected by the first microphone 13A from an analog signal to a digital signal. The second AD converter unit 1502 converts a voice signal collected by the second microphone 13B from an analog signal to a digital signal. In the following descriptions, the voice signal collected by the first microphone 13A is a voice signal of the first speaker and the voice signal collected by the second microphone 13B is a voice signal of the second speaker.
The voice filing processor unit 1503 generates an electronic file (a voice file) of the voice signal of the first speaker converted by the first AD converter unit 1501 and the voice signal of the second speaker converted by the second AD converter unit 1502. The voice filing processor unit 1503 stores the generated voice file in the storage device 1601 of the server 16.
The operation unit 1504 is a button switch etc. used for operating the recording device 15. For example, when an operator of the recording device 15 starts recording by operating the operation unit 1504, a start command of prescribed processing is input from the operation unit 1504 to the voice filing processor unit 1503. When the operator of the recording device 15 performs an operation to reproduce the recorded voice (a voice file stored in the storage device 1601) the recording device 15 reproduce the voice file read out from the storage device 1601 with a speaker that is not illustrated in the drawing. The recording device 15 also has the utterance condition determination device 5 determines the utterance condition of the second speaker at the time of reproducing the voice file.
The display unit 1505 displays the determination result (the satisfaction level of the second speaker) etc. of the utterance condition determination device 5.
Meanwhile, the server 16 includes a storage device 1601 and the utterance condition determination device 5. The storage device 1601 stores various data files including voice files generated in the voice filing processor unit 1503 of the recording device 15. The utterance condition determination device 5 determines the utterance condition (the satisfaction level) of the second speaker at the time of reproducing a voice file (a record of conversation between the first speaker and the second speaker) stored in the storage device 1601.
The voice section detection unit 541 detects a voice section in voice signals of the first speaker (voice signals collected by the first microphone 13A). Similarly to the voice section detection unit 501 of the utterance condition determination device 5 according to Embodiment 1, the voice section detection unit 541 detects a section in which the power obtained from a voice signal is at or above a certain threshold TH from among the voice signals of the first speaker as a voice section.
The backchannel section detection unit 542 detects a backchannel section in voice signals of the second speaker (voice signals collected by the second microphone 13B). Similarly to the backchannel section detection unit 502 of the utterance condition determination device 5 according to Embodiment 1, the backchannel section detection unit 542 performs morphological analysis of the voice signals of the second speaker and detects a section that matches any piece of backchannel data registered in a backchannel dictionary as a backchannel section.
The backchannel frequency calculation unit 543 calculates the number of times of backchannel feedbacks of the second speaker per speech duration of the first speaker as a backchannel frequency of the second speaker. The backchannel frequency calculation unit 543 sets a certain unit of time to be one frame and calculates the backchannel frequency based on the speech duration calculated from the voice section of the first speaker within a frame and the number of times of backchannel feedbacks calculated from the backchannel section of the second speaker. Similarly to Embodiment 1, the backchannel frequency calculation unit 543 in the utterance condition determination device 5 according to the present embodiment calculates a backchannel frequency IA(m) provided from the formula (4).
The average backchannel frequency estimation unit 544 estimates an average backchannel frequency of the second speaker. The average backchannel frequency estimation unit 544 calculates (estimates) an average of the backchannel frequency of the second speaker based on a voice section of the second speaker in a time period in which a prescribed number of frames have elapsed from the voice start time of the second speaker. The average backchannel frequency estimation unit 544 performs processing similar to the voice section detection unit 541 and detects a voice section in the voice signals of a prescribed number of frames (e.g., two frames) from the voice start time of the second speaker. The average backchannel frequency estimation unit 544 calculates a continuous speech duration Tj and a cumulative speech duration Tall of the second speaker from the start time startj′ to the end time endj′ of the detected voice section. The continuous speech duration Tj and the cumulative speech duration Tall are calculated from the following formulae (19) and (20), respectively.
Furthermore, the average backchannel frequency estimation unit 544 calculates a time Tsum provided from the following formula (21) by using the continuous speech duration Tj and the cumulative speech duration Tall.
T
sum=ξ1·Tj+ξ2·Tall (21)
In the formula (21), ξ1 and ξ2 are weighting coefficients and ξ1=ξ2=0.5 is given as an example.
Afterwards, the average backchannel frequency estimation unit 544 calculates an average backchannel frequency JE corresponding to the calculated time Tsum by referencing the correspondence table 545a of average backchannel frequency stored in the storage unit 545. Additionally, when a change is made to the speaker information info2(n) of the second speaker, the average backchannel frequency estimation unit 544 stores info2(n−1) and the average backchannel frequency JE in the speaker information list 545b of the storage unit 545. When a change is made to the speaker information info2(n) of the second speaker, the average backchannel frequency estimation unit 544 references the speaker information list 545b of the storage unit 545. When the changed speaker information info2(n) is on the speaker information list 545b, the average backchannel frequency estimation unit 544 reads out an average backchannel frequency JE corresponding to the changed speaker information info2(n) from the speaker information list 545b and output the average backchannel frequency JE to the determination unit 546. On the other hand, when the changed speaker information info2(n) is not on the speaker information list 545b, the average backchannel frequency estimation unit 544 uses a prescribed initial value JE0 as an average backchannel frequency JE until a prescribed number of frames has elapsed and calculates an average backchannel frequency JE in the above-described manner when a prescribed number of frames has elapsed.
The determination unit 546 determines the satisfaction level of the second speaker, i.e., whether or not the second speaker is satisfied, based on the backchannel frequency IA(m) calculated in the backchannel frequency calculation unit 543 and the average bacchanal frequency JE calculated (estimated) in the average backchannel frequency estimation unit 544. The determination unit 546 outputs a determination result v(m) based on the criterion formula provided in the following formula (22).
In the formula (22), β1 and β2 are correction coefficients and β1=0.2 and β2=1.5 are given as an example.
The determination unit 546 transmits the calculated determination result v(m) to the recording device 15, has the display unit 1505 of the recording device 15 display the determination result and outputs the determination result to the response score calculation unit 547.
The response score calculation unit 547 calculates a satisfaction level V of the second speaker throughout a conversation between the first and second speakers. This satisfaction level V is calculated by using the formula (14) provided in Embodiment 3 as an example. The response score calculation unit 547 transmits this overall satisfaction level V to the recording device 15 and has the display unit 1505 of the recording device 15 display the overall satisfaction level V.
Although Embodiments 1 to 3 calculate an average backchannel frequency based on a backchannel frequency of the second speaker, the present embodiment calculates (estimates) an average backchannel frequency based on the speech duration (voice section) of the second speaker as described above. A speaker who has a longer speech duration tends to make backchannel feedbacks more frequently than a speaker who has a shorter speech duration. For that reason, as in a correspondence table 545a illustrated in
The utterance condition determination device 5 according to the present embodiment performs the processing provided in
The utterance condition determination device 5 reads out voice files of the first and second speakers (step S500). Step S500 is performed by a readout unit (not illustrated) provided in the utterance condition determination device 5. The readout unit in the utterance condition determination device 5 reads out voice files of the first and second speakers corresponding to a conversation record designated through the operation unit 1504 of the recording device 15 from the storage device 1601. The readout unit outputs a voice file of the first speaker to the voice section detection unit 541 and the average backchannel frequency estimation unit 544. The readout unit also outputs a voice file of the second speaker to the backchannel section detection unit 542 and the average backchannel frequency estimation unit 544.
Next, the utterance condition determination device performs the average backchannel frequency estimation processing (step S501). Step S501 is performed by the average backchannel frequency estimation unit 544. After detecting a voice section in the voice signals of two frames (60 seconds) from the voice start time of the second speaker, the average backchannel frequency estimation unit 544 calculates a time Tsum by using the formulae (19) to (21). Afterwards, the average backchannel frequency estimation unit 544 references a correspondence table 545a of average backchannel frequency stored in the storage unit 545 and outputs to the determination unit 546 an average backchannel frequency JE corresponding to the calculated time Tsum as an average backchannel frequency of the second speaker.
Next, the utterance condition determination device 5 performs processing to detect a voice section from the voice file of the first speaker (step S502) and processing to detect a backchannel section from the voice file of the second speaker. Step S502 is performed by the voice section detection unit 541. The voice section detection unit 541 calculates a detection result u1(L) of a voice section in the voice file of the first speaker by using the formulae (1) and (2). The voice section detection unit 541 outputs the voice section detection result u1(L) to the backchannel frequency calculation unit 543. Step S503 is performed by the backchannel section detection unit 542. The backchannel section detection unit 542, after detecting a backchannel section by the above-described morphological analysis etc., calculates the detection result u2(L) of the backchannel section by using the formula (3). The backchannel section detection unit 542 outputs the detection result u2(L) of the backchannel section to the backchannel frequency calculation unit 543.
Note that in the flowchart in
When the processing in step S502 and S503 is ended, the utterance condition determination device 5, next, calculates a backchannel frequency of the second speaker based on the voice section of the first speaker and the backchannel section of the second speaker (step S504). Step S504 is performed by the backchannel frequency calculation unit 543. The backchannel frequency calculation unit 543 calculates the backchannel frequency IA(m) provided from the formula (4) by using the detection result of the voice section and the detection result of the backchannel section in the mth frame as explained in Embodiment 1.
The utterance condition determination device 5, next, determines the satisfaction level of the second speaker based on the average backchannel frequency JE and the backchannel frequency IA(m) of the second speaker and outputs a determination result (step S505). Step S505 is performed by the determination unit 546. The determination unit 546 calculates a determination result v(m) by using the formula (22).
Next, the utterance condition determination device 5 adds 1 to the number of frames of the satisfaction level corresponding to the value of the calculated determination result v(m) (step S506). Step S506 is performed by the response score output unit 547. Here, the number of frames of the satisfaction level is c0, c1, and c2 used in the formula (14). When the determination result v(m) is 0 as an example, 1 is added to a value of c0 in step S506. When the determination result v(m) is 1 or 2, 1 is added to a value of c1 or a value of c2, respectively, in step S506.
The utterance condition determination device 5, next, calculates a response core of the first speaker based on the number of frames of the satisfaction level and outputs the calculated response score (step S507). Step S507 is performed by the response score output unit 547. In step S507, the response score output unit 547 calculates the satisfaction level V of the second speaker by using the formula (14), and this satisfaction level V becomes a response score of the first speaker. The response score output unit 547 also outputs the calculated satisfaction level V (a response score) to a speaker (not illustrated) of the recording device 15.
After calculating the response score, the utterance condition determination device 5 decides whether or not to continue the processing (step S508). When the processing is not continued (step S508; NO), the utterance condition determination device 5 ends the readout of the voice files of the first and second speakers and ends the processing.
On the other hand, when the processing is continue (step S508; YES), the utterance condition determination device 5, next, checks whether or not a change is made to speaker information of the second speaker (step S509). When no change has been made to speaker information info2(n) of the second speaker (step S509; NO), the utterance condition determination device 5 repeats the processing in step S502 and subsequent steps. When a change has been made to the speaker information info2(n) of the second speaker (step S509; YES), the utterance condition determination device 5 brings the processing back to step S501, calculates the average backchannel frequency JE for the changed second speaker and performs the processing in step S502 and subsequent steps.
As described above, Embodiment 5 uses an average JE of backchannel frequency calculated on the basis of a continuous speech duration Tj and a cumulative speech duration Tall of the second speaker as an average backchannel frequency. For that reason, even though the second speaker is, for example, a speaker who infrequently gives backchannel feedback by nature, the average backchannel frequency can be calculated appropriately and therefore whether or not the second speaker is satisfied can be determined.
Note that the utterance condition determination device 5 according to the present embodiment can be applied not only to the recording system 14 illustrated in
In addition, the configuration of the utterance condition determination device 5 and the processing performed by the utterance condition determination device 5 are not limited to the configurations or the processing provided as an example in Embodiments 1 to 5.
The utterance condition determination device 5 provided as an example in Embodiments 1 to 5 can be realized by, for example, a computer and a program executed by the computer.
The processor 1701 is a processing unit such as Central Processing Unit (CPU) and controls the entire operations of a computer 9 by executing various programs including an operating system.
The main storage device 1702 includes a Read Only Memory (ROM) and a Random Access Memory (RAM). ROM in the main storage device 1702 records in advance prescribed basic control programs etc. that are read out by the processor 1701 at the time of startup of the computer 17, for example. RAM in the main storage device 1702 is used as a working storage area when necessary when the processor 1701 executes various programs. RAM in the main storage device 1702 can be used, for example, for temporary storage (retaining) of an average backchannel frequency that is an average of backchannel frequency etc., a voice section of the first speaker, and a backchannel section of the second speaker.
The auxiliary storage device 1703 is a high-capacity storage device such as a Hard Disk Drive (HDD) and Solid State Drive (SSD) with its capacity being larger compared with the main storage device 1702. The auxiliary storage device 1703 stores various programs executed by the processor 1701, various pieces of data and so forth. The programs stored in the auxiliary storage device 1703 include a program that causes the computer 17 to execute the processing illustrated in
The input device 1704 is, for example, a keyboard device or a mouse device, and when an operator of the computer 17 operates the input device 1704, input information associated with the content of the operation is transmitted to the processor 1701.
The display device 1705 is a liquid crystal display as an example. The liquid crystal display displays various texts, images, etc. in accordance with display data transmitted from the processor 1701 and so forth.
The interface device 1706 is, for example, an input/output device to connect electronic devices such as a microphone 201 and a receiver (speaker) 203 to the computer 17.
The recording medium driver unit 1707 is a device to read out programs and data recorded in a portable recording medium that is not illustrated in the drawing and to write data etc. stored in the auxiliary storage device 1703 in the portable recording medium. A flash memory having a Universal Serial Bus (USB) connector, for example, can be used as the portable recording medium. Additionally, optical discs such as Compact Disc (CD), Digital Versatile Disc (DVD), and Blu-ray Disc (Blu-ray is a trademark) can be used as the portable recording medium.
The communication device 1708 is a device that can communicate with the computer 17 and other computers etc. or that can connect the computer 17 and other computers etc. so as to be able to communicate with each other through a communication network such as the Internet.
The computer 17 can work as the voice call processor unit 202 and the display unit 204 in the first phone set 3 and the utterance condition determination device 5, for example, illustrated in
Furthermore, it is possible to cause the computer 17 to execute the processing to generate voice files from the voice signals of the first and second speakers for each voice call, as an example. The generated voice files may be stored in the auxiliary storage device 1703 or may be stored in the portable recording medium though the recording medium driver unit 1707. Moreover, the generated voice files can be transmitted to other computers connected through the communication device 1708 and the communication network.
Note that the computer 17 operated as the utterance condition determination device 5 does not need to include all of the elements illustrated in
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2015-171274 | Aug 2015 | JP | national |