The present application claims priority from Japanese Application JP 2023-212134, the content to which is hereby incorporated by reference into this application.
The disclosure relates to an information processing device and an information processing method.
In a known browsing system, identification information of an utterer and text data of a speech recognition result are stored in association with each other per piece of time information of an utterance. On an utterance list screen of a series of speech recognition results, markers can be assigned to utterances, and an utterance can be searched for according to a specified search condition. Search results are displayed on a search screen.
In the related art, even in a case that a meeting participant himself/herself performs an operation such as a note input on a terminal device or the like, the content of the operation is not reflected in contents of automatically generated minutes, and thus the contents of the minutes may not be accurate.
An object of the disclosure is to provide an information processing device and an information processing method in which accurate minutes data is generated based not only on character data converted from speech data for each utterance during a meeting but also on the content of an operation performed on a terminal device by a participant.
According to the disclosure, an information processing device includes one or more processors. The one or more processors acquire speech information including speech data and an utterance time per utterance issued by a plurality of participants of a meeting, convert the speech data into character data, acquire operation information including operation content and an operation time per operation performed by the plurality of participants, and generate minutes data based on the speech information, the character data, and the operation information.
According to the disclosure, an information processing method is a computer-implemented information processing method including acquiring speech information including speech data and an utterance time per utterance issued by a plurality of participants of a meeting, converting the speech data into character data, acquiring operation information including operation content and an operation time per operation performed by the plurality of participants, and generating minutes data based on the speech information, the character data, and the operation information.
According to the disclosure, the minutes data is generated based not only on the character data converted from the speech data for each utterance during the meeting but also on the content of the operation performed on the terminal device by the participant. As a result, the content of the operation by the participant is appropriately reflected, enabling minutes to be more accurately generated.
Embodiments of the disclosure will be described below with reference to the drawings. Note that, in the drawings, the same or equivalent components are denoted by the same reference numerals and signs, and description thereof will not be repeated.
First, an overall configuration of a minutes generating system 1 including an information processing device 100 that automatically generates minutes of a meeting will be described with reference to
As illustrated in
The information processing device 100 includes a Central Processing Unit (CPU) 110, a display 120 including a display screen 121, a storage 130, a communication unit 140, and a speech input part 150.
The speech input part 150 receives a sound of an utterance X issued by each participant P and outputs speech data Dx to a speech information acquiring unit 111. The speech input part 150 may output, to the speech information acquiring unit 111, not only directly input speech but also, for example, speech data transmitted from the outside via a network and received by the communication unit 140.
The speech input part 150 includes, but is not limited to, for example, a microphone. The speech input part 150 may be built in the information processing device 100. The speech input part 150 may be a wired or wireless hand microphone, or a microphone of a wireless headset. Two or more of these may be used in combination. In a case that a wireless headset is connected to the terminal device 10, speech data from the microphone is received by the information processing device 100 via a communication unit 16 and the communication unit 140, and transmitted to the speech input part 150. However, the wireless headset may be directly connected to the information processing device 100 wirelessly. In a case that a plurality of wireless headsets are used, each participant P uses a different wireless headset, and thus utterances X of a plurality of participants P are not mixed in one piece of speech data. This facilitates identification of the utterer of the speech data.
The CPU 110 controls each unit of the information processing device 100. The CPU 110 includes the speech information acquiring unit 111, a speech recognizing unit 112, an operation information acquiring unit 113, a minutes generating unit 114, a display controller 115, and a clock unit 116.
The CPU 110 may be, for example, one or more processors, one or more Micro Processing Units (MPUs), and one or more control devices/arithmetic devices, but is not limited thereto. For example, given the storage 130 stores an Operating System (OS. The operating system may also be referred to as “basic software”) operating in the CPU 110 and programs for functions corresponding to the respective units described above, the CPU 110 perform the OS and the programs to embody the units. Examples of the OS include, but are not limited to, Microsoft Windows (registered trademark), Android (registered trademark), and Linux (registered trademark). Each of the above units may include, but is not limited to, one or more electronic circuits, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), or the like.
The clock unit 116 includes a clock function, and outputs at least hour, minute, and second data at that point in time. The clock unit 116 may further include a calendar function and can output year, month, and day data.
The speech information acquiring unit 111 acquires speech information Ix including speech data Dx and an utterance time tx for each of the utterances X issued by the plurality of participants P of the meeting. The speech information acquiring unit 111 acquires the speech data Dx from the speech input part 150 and acquires the utterance time tx from the clock unit 116. For example, in a case that each participant P uses an individual wireless headset or the like and thus the utterer of each piece of speech data Dx is known in advance, the speech information Ix may include an utterer name.
The speech recognizing unit 112 converts the speech data Dx into character data Dc. Examples of speech recognition include, but are not limited to, an existing technology in which Artificial Intelligence (AI) analyzes a word uttered by a human being or a conversation between human beings and converts the word or the conversation into text data.
The operation information acquiring unit 113 acquires operation information Iy including operation content Cy and an operation time ty per operation Y performed by the participant P. Examples of the operation content Cy include an input of a character, a symbol, or the like, a specific operation, and the like. For example, the input of a character, a symbol, or the like includes, but is not limited to, an input of a predetermined character string related to the progress of a meeting, or an input of a special symbol (for example, ? or ⋆). The specific operation or the like includes, but is not limited to, a click operation, a double click operation, a tap operation, and the like.
The minutes generating unit 114 generates minutes data Dm based on the speech information Ix, the character data Dc, and the operation information Iy.
Therefore, the minutes data is generated based not only on the character data Dc converted from the speech data Dx for each utterance X during the meeting but also on the operation content of an operation such as note input performed by the participant P. As a result, the operation content Cy of the operation by the participant P is appropriately reflected, enabling minutes to be more accurately generated. Note that the minutes data Dm includes, but is not limited to, for example, the content or summary of each utterance X (so-called “transcription”), and an excerpt of the utterance content.
The minutes generating unit 114 may generate the minutes data Dm by regarding the operation content Cy corresponding to the operation time ty included in a predetermined proximity time tn determined per piece of character data Dc as corresponding to the character data Dc. The predetermined proximity time tn is a time from the utterance time tx corresponding to the character data Dc until a predetermined time has elapsed. The predetermined time is assumed to be, for example, several seconds to over ten seconds, but is not limited to such a time. Further, in a case that the operation content Cy includes character input data, the character data is approximate to the input data, and thus the corresponding character data Dc may be identified. The corresponding character data Dc can be more accurately identified by combination with the approximation determination based on the input characters. Note that the predetermined proximity time tn will be described later with reference to
Therefore, the minutes data Dm is generated by reflecting, in the character data Dc converted from the speech data Dx for each utterance X, the operation content Cy likely to actually correspond to the character data Dc. As a result, the minutes can be generated more accurately.
The display 120 displays the character data Dc. As will be described later, the operation content Cy may be visibly displayed. Further, the display 120 may display the minutes data Dm generated at regular intervals. As a result, the operation content Cy and how the minutes data Dm is created can be visually checked.
The display 120 may or may not include a memory function. Examples of the display including no memory function include, but are not limited to, liquid crystal and Organic Electro-Luminescence (EL). Examples of the display including a memory function include, but are not limited to, electronic paper.
The display controller 115 may display the character data Dc and the operation information Iy on the display 120. Therefore, the display 120 displays the character data Dc converted from the speech data Dx and the operation information Iy including the operation content Cy and the operation time ty. As a result, what type of operation is performed for what kind of utterance X can be visually checked.
The storage 130 stores the speech data Dx in association with the utterance time tx, and stores the operation content Cy in association with the operation time ty. Therefore, the speech data Dx for each utterance X is stored in association with the utterance time tx, and the operation content Cy is stored in association with the operation time ty. As a result, the minutes data Dm can be generated not only during the meeting but also after the meeting has ended.
The storage 130 also stores information and data necessary for controlling each unit of the information processing device 100. The storage 130 may store or save an OS, programs, and the like executed by the CPU 110. The storage 130 includes a memory, specifically, a volatile memory and a nonvolatile memory. Examples of the volatile memory include, but are not limited to, a Dynamic Random Access Memory (DRAM) and a Static Random Access Memory (SRAM). Examples of the nonvolatile memory include, but are not limited to, a Read-Only Memory (ROM), a flash memory, a Solid State Drive (SSD), and a hard disk.
The communication unit 140 is connected to the communication unit 16 of the terminal device 10 in a wired or wireless manner. As a result, the terminal device 10 and the information processing device 100 can communicate bidirectionally.
Next, an example of determining whether the utterance X and the operation Y by each participant P are temporally proximate to each other will be described with reference to FIG. 2.
As illustrated in
Here, the utterance time tx corresponds to the start time of each utterance X (the left end of the rectangle), but is not limited to such correspondence. The utterance time tx may be, for example, an intermediate time between the start time and the end time.
Each piece of speech data Dx is provided with the character data Dc obtained by conversion by the speech recognizing unit 112. Like the speech data Dx, the character data Dc is also managed with a serial number or time information attached to the character data Dc. The n-th speech data Dx(n) corresponds to the n-th character data Dc(n).
On the other hand, the operation content Cy of the operation performed by the participant P during the meeting is similarly managed with a serial number or time information attached to the operation content Cy. For example, the m-th operation content Cy is represented as “Co(m)” here.
As described above, the minutes generating unit 114 regards the operation content Cy corresponding to the operation time ty included in the predetermined proximity time tn determined for each character data Dc as corresponding to the character data Dc. For example, under the character data Dc(n) in the upper center of
That is, the minutes generating unit 114 determines that the operation content Cy(m) is temporally proximate to the n-th utterance X. This means that the operation content Cy(m) is likely to correspond to the n-th utterance X.
However, for example, when the duration is long as in the (n+1)-th utterance X, it may be difficult to accurately determine which part of the utterance X is temporally proximate to any operation Y. Preferably, for example, the utterance X in such a case is divided per sentence in the utterance X and the parts obtained by the division are separately dealt with. For example, in a case that the (n+1)-th utterance X is divided into four parts, which are separately dealt with, the second part can be determined to be temporally proximate to operation content Cy(m+1).
As described above, the terminal device 10 is a device operated by the participant P to create minutes, a note, and the like of the participant P during or before or after the meeting. Examples of the terminal device 10 include, but are not limited to, a notebook computer, a smartphone, and a tablet.
As illustrated in
The imaging part 12 interprets the face of the participant P who is operating the terminal device 10. The participants P can be distinguished from each other by recognition of a face image. The imaging part 12 may be, but is not limited to, for example, a built-in camera.
The operation input part 13 receives an operation input from the user. Examples of the operation input part 13 include, but are not limited to, a button, a keyboard, and a touch panel that enables a touch operation.
Examples of the display 14 include, but are not limited to, liquid crystal and organic electro-luminescence (EL). The display 14 may be a touch panel, and may also be used as the operation input part 13.
Next, examples of display of minutes by the information processing device 100 will be described with reference to
As illustrated in
For example, inputting a simple sentence such as “Description by the other party” causes the simple sentence to be displayed on the display 14. The operation time ty is displayed right-aligned in the same row as that in which the simple sentence is displayed.
For example, at a part that the participant P cannot hear, inputting a special symbol such as “the number of employees???” (in this case, “?”) causes the content of the minutes to be supplemented based on the character data Dc, with “the number of employees is 120” displayed, if possible.
In a case that “⋆” is preset as a symbol for designating an important part, and for example, “⋆1000 or more implementing companies” is input, the operation time ty is highlighted. Examples of the highlighting include, but are not limited to, display in a more eye-catching color or in bold.
For example, by providing an input including a character string related to the progress of the meeting such as “conclusion”, separation between subjects is recorded at a position corresponding to the operation time ty. The display 120 may display the manner of separating the subjects or the like in a way in which the manner can be explicitly indicated to the user, or the manner may be reflected instead of being displayed in a case that the minutes generating unit 114 generates minutes.
As illustrated in
For the utterance X for which the operation Y is present that is determined to be temporally proximate to the utterance X, the operation content Cy of the operation Y is displayed in the same row in a right-aligned manner. The operation time ty and the operator name may be displayed together with the operation content Cy. For example, the utterance X of Kawasaki at 14:15 is displayed in such a manner that a reaction made at 14:19 by Yoshimura, who is a participant P, can be realized at a glance. This display does not indicate the content of the reaction, but may be replaced with display that allows the content to be known. Note that an application (annotation application) suitable for the disclosure is activated on the terminal device 10 used by Yoshimura, and the display 14 indicates that Yoshimura input “The ayes have it.” at 14:19. Note that the display position and the display method are not limited to those described above.
In a case that the operation content Cy includes an input of a character string C related to the progress of the meeting, the minutes generating unit 114 may separate the minutes data Dm for each subject at a position corresponding to the character string C. Therefore, in a case that the operation content Cy of the participant P includes the input of the character string C related to the progress of the meeting, the minutes data Dm is separated for each subject at the position corresponding to the character string. As a result, the minutes data Dm is accurately separated for each subject.
The display 120 of the information processing device 100 indicates that Tanaka said “Thank you. I would like to move on to the next subject.” at 14:24. In response to the utterance X, the display 14 of the terminal device 10 indicates that Yoshimura input “----” at 14:28. Here, “-” or continuous “-” is a part of character string related to the progress of the meeting. Thus, the minutes generating unit 114 may separate the minutes data Dm at a position corresponding to “----”. A separation line may be displayed at the separation position. In addition, a character string “topic division” may be displayed together with the operation time ty and the operator name on the same row as the separation line. Note that the display position and the display method are not limited to those described above.
In a case that the operation content Cy includes a predetermined operation or an input of a predetermined symbol, the minutes generating unit 114 may generate the minutes data Dm corresponding to the operation content Cy as an important part. Therefore, when the operation content Cy includes a predetermined operation or an input of a predetermined symbol, a part corresponding to the predetermined operation is determined as an important part. As a result, the important part in the minutes data Dm is accurately distinguished.
The display 120 of the information processing device 100 indicates that Yamada said “I have had it summarized including supplements to the previously reported matters” at 15:04. The display 14 of the terminal device 10 indicates that, in response to this utterance X, Yoshimura input “!” at 15:04. Here, “!” is a symbol indicating an important part of the meeting. Thus, the minutes generating unit 114 may generate the minutes data Dm using, as an important part, a part corresponding to “!”. A mark or the like indicating importance may be displayed in the same row as that of the utterance X indicated as an important part, together with the operation time ty and the operator name. Specifically, the mark may be displayed in a more eye-catching color or in reverse video. Note that the display position and the display method are not limited to those described above.
As illustrated in
Based on the important part, the minutes generating unit 114 may extract main points from the generated minutes data Dm and generate a summary of the minutes data Dm. Therefore, not only the minutes data Dm is generated, but also main points are extracted and a summary is generated based on the part considered to be important by the participant P. As a result, more accurate main points and summary can be obtained.
Note that the main points and the summary may be collectively generated as minutes after the meeting, but may be generated or corrected in real time in accordance with the progress of the utterances X of the participants P and the operation Y from the terminal device 10, or in accordance with the quantity of utterances or elapsed time after the beginning of the meeting, the timing of separation between the subjects, or the like to update the display content of the display screen 121.
Next, an outline of an information processing method for generating minutes by the information processing device 100 will be described with reference to
As illustrated in
In step S2, the speech recognizing unit 112 analyzes the speech data Dx and converts the speech data Dx into the character data Dc. Note that step S2 is an example of a “speech recognizing step” of the disclosure.
In step S3, the operation information acquiring unit 113 acquires the operation information Iy including the operation content Cy and the operation time ty for each of the operations Y performed by the participants P. Note that step S3 is an example of an “operation information acquiring step” of the disclosure.
Finally, in step S4, the minutes generating unit 114 generates the minutes data Dm based on the speech information Ix, the character data Dc, and the operation information Iy, and then terminates the series of processing operations. Note that step S4 is an example of a “minutes generating step” of the disclosure.
Therefore, the minutes data is generated based not only on the speech recognition result (character data Dc) obtained by analyzing the speech data Dx for each utterance X during the meeting but also on the operation content Cy such as a note input performed by the participant P. As a result, the operation content Cy of the operation by the participant P is appropriately reflected, enabling minutes to be more accurately generated.
Next, a case where the plurality of participants P perform operation inputs and the like on the respective terminal devices 10 will be described with reference to
As illustrated in
In a case that the operation information acquiring unit 113 acquires the operation information Iy including the operation content Cy from the plurality of participants P, the minutes generating unit 114 may regard a part of the minutes data Dm corresponding to the operation content Cy from the plurality of participants P as an important part. Therefore, the input operations from the plurality of participants P causes the part of the minutes data Dm corresponding to the operation contents Cy of the input operations to be regarded as an important part in the minutes data Dm. As a result, the important part in the minutes data is more accurately distinguished.
On the display 120 of the information processing device 100, each transcribed utterance X is chronologically displayed together with the utterance time tx and the utterer name, and the operation content Cy is also displayed, as in the case of the display illustrated in
In response to the utterance of Yamada “I will adopt plan A. Do you have any objections?” at 14:04, Kawasaki issued an utterance at 14:15, and Kinoshita issued an utterance at 14:19. As seen in the minutes, in spite of the lack of utterances, Takeda made a reaction at 14:05, and Yoshimura made a reaction at 14:19. The minutes clearly indicates how and why, in response to the utterances and reactions described above, Tanaka determined that agreement had been obtained from the participants P, and said “Thank you. I would like to move on to the next subject then.” at 14:24.
The occurrence of a plurality of reactions to a specific utterance X indicates that the utterance X can be determined to be likely to correspond to an important part. Note that, for a small number of participants P, the utterance can be determined to be likely to correspond to an important part simply due to the occurrence of a plurality of reactions but that this is not the case with a large number of participants P. For example, a threshold for the number of persons may be provided, and the utterance may be determined to be likely to correspond to an important part only in a case that the number of reacting persons exceeds the threshold. In addition, for example, the weight of the determination may be changed according to the role, the post, or the like of the participant P. For example, while the number of rank-and-file employees is limited to five or more, an utterance of even one official may be determined to be highly likely to correspond to an important part. Regarding the role, the post, or the like of the participant P, for example, the role, the post, or the like, or the weight of the determination may be stored in the storage 130 in association with the participant name.
The disclosure may be embodied in other various forms without departing from the spirit or essential characteristics of the disclosure. Thus, the above embodiments are merely examples in all respects and should not be interpreted as limiting. The scope of the disclosure is indicated by the claims and is not limited to the description. Furthermore, all modifications and changes within the range of equivalency of the claims are included in the scope of the disclosure.
The disclosure can be utilized for an information processing device, an information processing method, and the like.
While there have been described what are at present considered to be certain embodiments of the invention, it will be understood that various modifications may be made thereto, and it is intended that the appended claim cover all such modifications as fall within the true spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2023-212134 | Dec 2023 | JP | national |