The present invention relates to a technology of assisting a conference.
In recent years, some devices are proposed to facilitate a conference to make the conference more efficient by sensing a state of the conference with the voices in the conference. Such devices are called a conference assistance device. Japanese Unexamined Patent Application Publication No. 2011-223092 discloses an example of such devices. In Japanese Unexamined Patent Application Publication No. 2011-223092, in teleconferencing using a network, to provide speaking opportunities to all conference participants, a next speaking recommendation value is automatically determined from voice input histories of the participants and durations of no voice. In response to the value, a speaking voice volume is adjusted.
It is difficult to know a timing of speaking in a conference. Particularly when a conference is teleconferencing, when social standings, positions, and views are different among participants, or when participants do not know each other well, difficulty increases. In the past technology, it is difficult to know a suitable speaking timing. Additionally, it is difficult to consider willingness of a participant to speak.
It is thus desirable to efficiently facilitate speeches of conference participants.
A preferable aspect of the present invention includes a conference assistance system that indicates a score to recommend a speech of a participant in a conference based on information inputted from an interface.
Another preferable aspect of the present invention includes a conference assistance method executed by an information processing device. Based on information inputted from an interface, a score is calculated to recommend a speech of a participant in a conference.
As a further specific section, at least one of a voice and image of a current speaker is inputted. Based on at least one of the voice and image of the current speaker, alertness of the current speaker is estimated. Based on the alertness, a first timing score is estimated.
As a further specific section, speech recommendations from other participants are inputted. Based on a total of the speech recommendations from other participants, a second timing score is estimated. Each of values of the speech recommendations from other participants decreases as time passes since each speech recommendation is made.
As a further specific section, a text of speech content of a current speaker and a text of a past speech of a score calculation subject are inputted. Based on a relationship between the speech content of the current speaker and the past speech of the score calculation subject, a third timing score is estimated.
Speeches of conference participants can be efficiently facilitated.
Hereafter, embodiments are described using the drawings. The present invention is not limited to the descriptions of the following embodiments. Without departing from the spirit and scope of the present invention, modification of a specific configuration of the invention can be easily understood by the persons skilled in the art.
In after described configurations of the invention, the same parts or the parts having a similar function use the same reference sign through different drawings. The duplicative description may be omitted.
Multiple components having the same or similar function may use the same reference sign having a different suffix. When the multiple components do not need to be distinguished, the suffix may be omitted.
The descriptions “first,” “second,” and “third” are attached to identify components and does not necessarily limit the number, order, or contents of the components. Numbers to identify components are used in each context. A certain number used in a context does not necessarily indicate the same component in another context. A component identified by a certain number is not prevented from having a function of a component identified by another number.
An actual position, size, shape, and range of each component in the drawings may not be described to facilitate the understanding of the invention. Therefore, the present invention is not necessarily limited to the positions, sizes, shapes, ranges disclosed in the drawings.
The publications, patents, and patent applications quoted in this specification form part of the explanation of this specification without change.
The components expressed in a singular form in this specification include a plural form unless clearly indicated in a specific context.
An example of a system explained in the following embodiments is as follows. A score indicating whether a current timing is appropriate as a speech timing is indicated to conference participants individually or simultaneously. This score is called a speech timing score. This score is calculated from any one, two, or three of alertness of a current speaker, recommendations from other participants, and a relationship between a speech of a current speaker and a past speech of a score calculation subject. The score is indicated to participants as a current speech timing score.
With such a system, conference participants can know an appropriate speech timing. Additionally, a speech opportunity can be efficiently provided to a participant who hesitates to speak.
In the first embodiment, a speech timing score of each participant is calculated from alertness estimated from a voice and face image of a current speaker. The speech timing score is then presented. In this embodiment, when the alertness of a speaker is not high, the speech timing score is calculated to be high, for example.
Hereafter, with reference to
The functions such as calculations and controls are achieved when the CPUs 1001, 1006, and 1015 execute programs stored in the memories 1002, 1007, and 1016 in cooperation with other hardware. A program, a function of the program, or a section of achieving the function may be called a “function,” “section,” “portion,” “unit,” or “module”.
The flow of
The alertness estimation portion 102 estimates alertness through a mechanical learning model based on either or both of the inputted speaker face image 100 and speaker voice 101 or through a rule-based model based on a feature value such as an amplitude or speech speed of the speaker voice 101. The alertness can be used as an evaluation index about how a speaker is excited or emotional.
The alertness estimated in the alertness estimation portion 102 is inputted into the speech timing score estimation portion 103. A speech timing score 104 is outputted from the speech timing score estimation portion 103. The speech timing score 104 is defined as a function in inverse proportion to alertness. For example, the timing score is low when a speaker is excited, and the timing score is high when a speaker is calm. Speaking may be thus easy when the timing score is high. The speech timing score 104 outputted from the speech timing score estimation portion 103 is displayed on the image output I/F 1012 in the personal terminal 1005 in
As above, in this embodiment, a speech timing score of each participant is calculated from alertness of a current speaker. For example, when a high social status participant or an influential participant participates in a conference, this embodiment is effective to make other participants easily speak.
A feature value estimated from a voice and face image of a current speaker includes alertness in this embodiment. The feature value may include other emotions of the current speaker.
Based on at least one of properties of a speaker and participants, a speech timing score may be weighted. For example, when a status of a current speaker is high, a speech timing score is low. When a status of a participant (speech timing score calculation subject) is high, a speech timing score is high. Such information may be acquired from an unillustrated personnel database.
In the second embodiment, a speech timing score of each participant is calculated from recommendations from other participants, and presented. Any participants can recommend speeches of any other participants by using the personal terminals 1005, 1014 at any timings. A speech recommendation is inputted, for example, from the command input I/F 1022 in the personal terminal 1005 of
In Equation 1, γτ is a total value of speech recommendations for a speech timing score calculation subject, and f(τ) is zero in τ>t, maximum in τ=t, and monotonically decreases as τ decreases.
A speech timing score 107 outputted from the speech timing score estimation portion is displayed on the image output I/F 1012 in the personal terminal 1005 and on the image output I/F 1021 in the personal terminal 1014 in
The method of displaying the speech timing score is the same as that of the first embodiment. As above, in this embodiment, a speech timing score of each participant is calculated from recommendations from other participants. This embodiment is effective, for example, in a conference in which free thinking is expected.
In the third embodiment, a speech timing score of each participant is calculated from a relationship between a speech of a current speaker and a past speech of a score calculation subject, and presented. Hereafter, with reference to
A speech 108 of a current speaker and a past speech voice 109 of a score calculation subject are input to the voice recognition portion 110. The voice recognition portion 110 estimates a speech text of the speech 108 of the current speaker and a speech text of the past speech voice 109 of the score calculation subject through a known speech recognition technique. The estimated speech texts are inputted into the speech timing score estimation portion 111.
The speech timing score estimation portion 111 estimates a speech timing score 112 based on a relationship between the speech text estimated from the speech 108 of the current speaker and the speech text estimated from the past speech voice 109 of the score calculation subject. An example of the estimation may include a function to acquire a high score when the relevance between both texts is high.
The speech timing score estimation portion 111 can use, for example, a machine learning model with a teacher. Alternatively, the texts are subjected to vector transformation. Then, based on the number of occurrences or frequency of the same or similar words or on the contextual similarity, the estimation is made.
The pooled past speech voices 109 of a score calculation subject are inputted into the voice recognition portion 110 in this figure. The speech text data estimated from the past speech voices 109 of the score calculation subject through the speech recognition may be pooled. The speech 108 of the current speaker may be transformed to text by a different system and inputted from an interface. The method of displaying a speech timing score is the same as that of the first embodiment and the second embodiment.
As above in this embodiment, a speech timing score of each participant is calculated from a relationship between a speech of a current speaker and a past speech of a score calculation subject. This embodiment is effective, for example, when a speech of a participant who has knowledge about or is interested in a current topic is to be facilitated.
In the fourth embodiment, a speech timing score of each participant is calculated from a combination of two or more of three elements including alertness of a current speaker, recommendations from other participants, and a relationship between a speech of the current speaker and a past speech of a score calculation subject, and presented.
With reference to
The hardware configuration in this embodiment is the same as that of the first to third embodiments as in
In this embodiment,
Either or both of a speaker face image 113 and a speaker voice 114 are inputted into the alertness estimation portion 116. As in the first embodiment, alertness is estimated through a mechanical leaning model based on either or both of the speaker face image 113 and speaker voice 114 or through a rule-based model based on a feature value such as an amplitude or speech speed of the speaker voice 101.
The alertness estimated in the alertness estimation portion 116 is inputted into the Sat estimation portion 117. The Sat estimation portion 117 outputs a speech timing score Sat based on the alertness. As in the first embodiment, Sat is defined as a function in inverse proportion to the alertness.
As in the third embodiment, the speaker voice 114 and past speech voice 115 of a score calculation subject are inputted into the voice recognition portion 118. The voice recognition portion 118 estimates each speech text of the speaker voice 114 and past speech voice 115 of a score calculation subject through a known speech recognition technique. The estimated speech text is inputted into the Sct estimation portion 119. As in the third embodiment, the Sct estimation portion 119 estimates Sct based on a relationship between a speech text estimated from the speaker voice 114 and a speech text estimated from the past speech voice 115 of a score calculation subject. An estimation example may include a function to acquire a high score when a relevance between both texts is high. In this figure, as in the third embodiment, the pooled past speech voice 115 of a score calculation subject are inputted to the voice recognition portion 118. The speech text data estimated from the past speech voice 115 of the score calculation subject by speech recognition may be pooled.
Speech recommendations 120 from other participants are inputted into the Srt estimation portion 121 as in the second embodiment. The speech recommendations 120 from other participants are acquired from the command input I/F 1022 in the personal terminal 1005 in
In equation 2, γτ is a total value of speech recommendations for a speech timing score calculation subject at a time τ, and f(τ) is 0 in τ>t, maximum in τ=t, and monotonically decreases as τ decreases.
To the speech timing score St estimation portion 122, Sat estimated in the Sat estimation portion 117, Sct estimated in the Sct estimation portion 119, and Srt estimated in the Srt estimation portion 121 are inputted. The speech timing score St is then outputted. The speech timing score St estimation portion 122 calculates the speech timing score St based on the following equation.
S
t
=w
a
S
a
t
+w
r
S
r
t
+w
c
S
c
t
In this equation, wa, wr, and wc are any weights and adjusted to adjust contributions of Sat, Srt, and Sct to St. The values of wa, wr, and wc are desirably changed based on a feature of a conference. Some preset patterns can be prepared.
Some examples of the preset patterns are described. The first pattern is such that a higher social status person and a lower social status person participate in a conference. To think about the higher social status person in this case, the value of wa is set higher than wr and wc. In this case, the value of wa can also be automatically increased only during a speech of a specific speaker.
The second pattern is such that a conference requires free thinking. In this case, to emphasize speech recommendations from other participants, the value of wr is set higher than wa and wc. The third pattern is such that similar social status persons participate in a conference. In this case, to emphasize context of the conference, the value of wc is set higher than wa and wr. Before or during a conference, a user (for example, chairperson) may choose a feature of the conference from the preset patterns or the values of wa, wr, and wc may be specifically specified.
The fifth embodiment provides a simpler system than the first to fourth embodiments. Through any one of the methods of the first to fourth embodiments, the speech timing scores St of all participants are calculated. When the speech timing scores St of all the participants are a predetermined threshold or less, a signal illuminates to indicate that “any participants now have an appropriate speech timing” in devices referenceable by all the participants or a specific participant.
Hereafter, with reference to
The speech timing score outputted from the speech timing score estimation portion 901 is inputted into the speech timing signal transmission portion 124. The speech timing signal transmission portion 124 outputs a speech timing signal 125 when the inputted speech timing score is a fixed threshold or less. The timing signal is indicated to conference participants by the signal transmitter 1029, the voice output I/Fs 1010, 1019, or the image output I/Fs 1012, 1021 in
As above, in this embodiment, without indicating a speech timing score of each conference participant, when speech timing scores of all participants (or a predetermined percentage of participants) are a predetermined threshold or less, the signal that “any participants now have an appropriate speech timing” is indicated to an unspecified number of the participants. This embodiment is effective in a simply configured conference assist system.
The sixth embodiment assumes that not only a conference but also in a conversation among multiple persons includes a device that enables participants to automatically speak. The automatic speech device is called a speech robot. The speech timing score explained in the first to fourth embodiments is calculated for the speech robot to facilitate or suppress the speech of the speech robot.
Hereafter, with reference to
The speech facilitation suppression control portion 126 determines whether to facilitate or suppress a speech of the robot based on the inputted speech timing score 123 to output a speech facilitation suppression coefficient. As a method of determining the speech facilitation suppression coefficient, a threshold for a speech timing score is provided. When the speech timing score is the threshold or more, the coefficient indicates facilitation. When the speech timing score is the threshold or less, the coefficient indicates suppression. The speech timing score may be multiplied by any coefficients to determine speech facilitation suppression coefficients of successive values.
The speech facilitation suppression coefficient may be defined through any procedures. The speech facilitation suppression coefficient herein is a value between zero and one. As the value is low, a speech is suppressed. As the value is high, a speech is facilitated. A speech text generation portion 127 generates and outputs a speech text of the speech robot through a known rule based or machine learning technique. The speech facilitation suppression coefficient outputted from the speech facilitation suppression control portion 126 and the speech text outputted from the speech text generation portion 127 are inputted into a speech synthesis portion 128. Based on the inputted value of the speech facilitation suppression coefficient, the speech synthesis portion 128 determines whether to synthesize a speech voice signal based on the inputted speech text. Upon determining to synthesize the speech voice signal, the speech synthesis portion 128 synthesizes a speech voice signal 129. The synthesis may be determined through a method using a threshold provided to a speech timing score per each speech or through a combination of this method and another known method. The outputted speech voice signal 129 is converted to a speech waveform in the voice output I/F 1038 in the speech robot 1033 in
According to this embodiment, speech opportunities for participants can be actively indicated during a conference as a score for a system to recommend a speech. The indication is possible using numeral values, a time series graph, or lighting of a signal when the score is lower or higher than a threshold. The score may be indicated to all participants or to a specific participant such as a chairperson. A participant who sees the score can numerically recognize that the participant can easily speak, is expected to speak, or can provide a meaningful speech.
Number | Date | Country | Kind |
---|---|---|---|
2019-152897 | Aug 2019 | JP | national |