The present case relates to a storage medium, an editing support method, and an editing support device.
It is known that voice data including speech data of a plurality of speakers is reproduced, and a user transcribes the speech data of each speaker into text to set a speaker name indicating the speaker in each speech data. Furthermore, it is also known that voice data is classified on the basis of voice characteristics, and optional speaker identification information is obtained for each classified voice data (e.g., see Patent Document 1).
According to an aspect of the embodiments, a non-transitory computer-readable storage medium storing an editing support program that causes at least one computer to execute a process, the process includes: displaying, on a display unit, information that indicates a speaker identified with a sentence generated based on voice recognition in association with a section of the sentence, the section corresponding to the identified speaker; when a first editing process that edits an identification result of the speaker occurs and respective speakers of two or more sections that are adjacent are common due to the first editing process, displaying the two or more sections in a combined state on the display unit; and when a start point of a section to be subject to a second editing process that edits the identification result of the speaker is specified in a specific section within the combined two or more sections and a location that corresponds to a start point of one of the two or more sections before being combined is present between the specified start point and an end point of the combined two or more sections, applying the second editing process to a section from the specified start point to the location that corresponds to the start point of the one of the two or more sections.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
The speaker identification information obtained from the voice characteristics may change at times depending on the physical condition of the speaker and the like. As a result, there is a possibility that the speaker identification information represents a wrong speaker. In this case, there is a problem that the user takes time and effort for the editing process of the speaker identification information.
In view of the above, in one aspect, it aims to improve the convenience of an editing process for an identification result of a speaker.
It is possible to improve the convenience of an editing process for an identification result of a speaker.
Hereinafter, embodiments of the present case will be described with reference to the drawings.
The display 100G displays various screens. Although details will be described later, the display 100G displays an editing support screen 10, for example. The editing support screen 10 is a screen that supports editing of a speaker identified with regard to a sentence generated on the basis of voice recognition. Identification of the speaker may be performed using artificial intelligence (AI), or may be performed using a predetermined voice model defined in advance without using AI.
A user who uses the terminal device 100 confirms candidates for the speaker displayed on the editing support screen 10, and operates the keyboard 100F to select one of the candidates from the candidates for the speaker. As a result, the terminal device 100 edits the unedited speaker identified on the basis of AI or the like to the selected candidate speaker. In this manner, the user is enabled to easily edit the speaker by using the editing support screen 10. Note that, although a preparer of minutes of a conference will be described as an example of the user in the present embodiment, the user is not particularly limited to such a preparer. For example, the user may be a producer of broadcast subtitles, a person in charge of audio recording in a call center, or the like.
Next, a hardware configuration of the terminal device 100 will be described with reference to
Moreover, the terminal device 100 may also include at least one of a hard disk drive (HDD) 100E, an input/output I/F 100H, a drive device 100I, and a short-range wireless communication circuit 100J as needed. The CPU 100A to the short-range wireless communication circuit 100J are connected to each other by an internal bus 100K. For example, the terminal device 100 may be constructed by a computer. Note that a micro processing unit (MPU) may be used as a hardware processor instead of the CPU 100A.
A semiconductor memory 730 is connected to the input/output I/F 100H. Examples of the semiconductor memory 730 include a universal serial bus (USB) memory, a flash memory, and the like. The input/output I/F 100H reads a program and data stored in the semiconductor memory 730. The input/output I/F 100H has a USB port, for example. A portable recording medium 740 is inserted into the drive device 100I. Examples of the portable recording medium 740 include a removable disk such as a compact disc (CD)-ROM and a digital versatile disc (DVD). The drive device 100I reads a program and data recorded in the portable recording medium 740. The short-range wireless communication circuit 100J is an electric circuit or an electronic circuit that implements short-range wireless communication, such as Wi-Fi (registered trademark) and Bluetooth (registered trademark). An antenna 100J′ is connected to the short-range wireless communication circuit 100J. A CPU that implements a communication function may be used instead of the short-range wireless communication circuit 100J. The network I/F 100D has a local area network (LAN) port, for example.
Programs stored in the ROM 100C and the HDD 100E are temporarily stored in the RAM 100B described above by the CPU 100A. The program recorded in the portable recording medium 740 is temporarily stored in the RAM 100B by the CPU 100A. With the stored program executed by the CPU 100A, the CPU 100A implements various functions to be described later, and executes various kinds of processing to be described later. Note that the program only has to be in accordance with flowcharts to be described later.
Next, a functional configuration of the terminal device 100 will be described with reference to
Here, the storage unit 110 includes, as constituent elements, a voice storage unit 111, a dictionary storage unit 112, a sentence storage unit 113, a model storage unit 114, and a point storage unit 115. The processing unit 120 includes, as constituent elements, a first display control unit 121, a voice recognition unit 122, a sentence generation unit 123, and a speaker identification unit 124. Furthermore, the processing unit 120 includes, as constituent elements, a voice reproduction unit 125, a speaker editing unit 126, a point management unit 127, and a second display control unit 128.
Each of the constituent elements of the processing unit 120 accesses at least one of the respective constituent elements of the storage unit 110 to execute various kinds of processing. For example, when the voice reproduction unit 125 detects an instruction for reproducing voice data, it accesses the voice storage unit 111 to obtain voice data stored in the voice storage unit 111. When the voice reproduction unit 125 obtains the voice data, it reproduces the voice data. Note that other constituent elements will be described in detail at the time of describing operation of the terminal device 100.
Next, operation of the terminal device 100 will be described with reference to
First, as illustrated in
The first registration button 21 is a button for registering voice data of a conference. In the case of registering voice data of a conference, the user prepares voice data of a conference recorded in advance in the terminal device 100. When the user performs operation of pressing the first registration button 21 with a pointer Pt, the first display control unit 121 detects the pressing of the first registration button 21. When the first display control unit 121 detects the pressing of the first registration button 21, it saves the voice data of the conference prepared in the terminal device 100 in the voice storage unit 111.
The second registration button 22 is a button for registering material data related to material of a conference. In the case of registering material data, the user prepares material data of a conference in advance in the terminal device 100. When the user performs operation of pressing the second registration button 22 with the pointer Pt, the first display control unit 121 detects the pressing of the second registration button 22. When the first display control unit 121 detects the pressing of the second registration button 22, it displays the material data prepared in the terminal device 100 in a first display area 20A in the portal screen 20.
The third registration button 23 is a button for registering participants of a conference. In the case of registering participants of a conference, the user performs operation of pressing the third registration button 23 with the pointer Pt. When the user performs operation of pressing the third registration button 23, the first display control unit 121 detects the pressing of the third registration button 23. When the first display control unit 121 detects the pressing of the third registration button 23, it displays, on the display unit 140, a registration screen (not illustrated) for registering the participants of the conference as speakers. When the user inputs a speaker (specifically, information indicating a speaker name) in the conference on the registration screen, the first display control unit 121 displays participant data including the input speaker in a second display area 20B in the portal screen 20. At the same time, the first display control unit 121 generates speaker ID, and saves it in the model storage unit 114 in association with the input speaker. The speaker ID is information that identifies the speaker. As a result, the model storage unit 114 stores the speaker ID and the speaker in association with each other.
The fourth registration buttons 24 is each a button for registering voice data of a speaker. In the case of registering voice data of a speaker, the user prepares various voice data of the speaker recorded in advance in the terminal device 100. A microphone may be connected to the terminal device 100, and the voice data obtained from the microphone may be used. When the user performs operation of pressing the fourth registration button 24 related to the speaker to be registered with the pointer Pt, the first display control unit 121 detects the pressing of the fourth registration button 24. When the first display control unit 121 detects the pressing of the fourth registration button 24, it outputs the voice data prepared in the terminal device 100 to the speaker identification unit 124.
The speaker identification unit 124 generates a learned model in which characteristics of the voice of the speaker are machine-learned on the basis of the voice data of the speaker output from the first display control unit 121. The speaker identification unit 124 saves, in the model storage unit 114, the generated learned model in association with the speaker ID of the speaker corresponding to the voice data to be learned. As a result, as illustrated in
Returning to
When the processing of step S102 is complete, the sentence generation unit 123 then generates sentence data (step S103). More specifically, when the sentence generation unit 123 receives the character string data output by the voice recognition unit 122, it refers to the dictionary storage unit 112 to perform morphological analysis on the character string data. The dictionary storage unit 112 stores a morpheme dictionary. Various words and phrases are stored in the morpheme dictionary. For example, the morpheme dictionary stores words and phrases such as “yes”, “indeed”, “material”, and “question”. Therefore, when the sentence generation unit 123 refers to the dictionary storage unit 112 and performs the morphological analysis on the character string data, it generates sentence data in which the character string data is divided into a plurality of word blocks. When the sentence generation unit 123 generates sentence data, it saves the generated sentence data in the sentence storage unit 113 in association with an identifier of each word block. As a result, the sentence storage unit 113 stores the sentence data.
When the processing of step S103 is complete, the speaker identification unit 124 then identifies the speaker (step S104). More specifically, the speaker identification unit 124 refers to the model storage unit 114 to compare the learned model stored in the model storage unit 114 with the voice data of the conference stored in the voice storage unit 111. The speaker identification unit 124 compares the learned model with the voice data of the conference, and in the case of detecting a voice part corresponding to (e.g., common to, similar to, or the like) the learned model in the voice data of the conference, it identifies the time code and the speaker ID associated with the learned model. In this manner, the speaker identification unit 124 identifies each speaker of various voice parts included in the voice data of the conference. When the speaker identification unit 124 identifies the speaker ID and the time code, it associates the identified speaker ID with the sentence data stored in the sentence storage unit 113 on the basis of the time code. As a result, as illustrated in
As illustrated in
When the processing of step S104 is complete, the first display control unit 121 then displays the speaker and an utterance section (step S105). More specifically, when the processing of step S104 is complete, the first display control unit 121 stops displaying the portal screen 20 on the display unit 140, and displays the editing support screen 10 on the display unit 140. Then, the first display control unit 121 displays the speaker and the utterance section corresponding to the speaker in association with each other in the editing support screen 10.
Therefore, as illustrated in
In the script area 11, the time code and the characters of the sentence data stored in the sentence storage unit 113 are displayed in a state of being associated with each other. In particular, in the script column in the script area 11, characters from the first time code in which the speaker ID is switched to the last time code in which the continuity of the speaker ID stops are displayed in a combined manner in a time series. In the setting area 12, setting items related to a reproduction format of the voice data, setting items related to an output format of the sentence data after the speaker is edited, and the like are displayed.
As described above, the speaker and the utterance section are displayed in association with each other in the editing area 13. For example, a speaker “Oda” and an utterance section “ . . . , isn't it?” are displayed in association with each other in the editing area 13. Similarly, a speaker “Kimura” and an utterance section “indeed, yes, I have a question about the material” are displayed in association with each other. A speaker “Yamada” and an utterance section “please ask a question” are displayed in association with each other.
Furthermore, in the editing area 13, a progress mark 16 and a switching point 17 are displayed in addition to the speakers and the utterance sections. The progress mark 16 is a mark indicating the current playback position of the voice data. The switching point 17 is a point indicating switching of a word block (see
The switching point 17 can be moved to the left or right in response to operation performed on the input unit 130. For example, when the user performs operation of pressing a cursor key indicating a right arrow, the first display control unit 121 moves the switching point 17 to the right. When the user performs operation of pressing a cursor key indicating a left arrow, the first display control unit 121 moves the switching point 17 to the right. Note that, in the case of moving the switching point 17 in one direction on the right side, a key for moving the switching point 17 may be a space key. It is sufficient if the key for moving the switching point 17 is appropriately determined according to the design, experiment, and the like.
When the processing of step S105 is complete, the voice reproduction unit 125 then waits until a reproduction instruction is detected (NO in step S106). When the voice reproduction unit 125 detects a reproduction instruction (YES in step S106), it reproduces the voice data (step S107). More specifically, when the play button 14 (see
When the processing of step S107 is complete, the first display control unit 121 waits until a start point is specified (NO in step S108). When the start point is specified (YES in step S108), the first display control unit 121 displays a first editing screen (step S109). More specifically, as illustrated in
When the processing of step S109 is complete, the speaker editing unit 126 waits until a selection instruction is detected (NO in step S110). When the speaker editing unit 126 detects a selection instruction (YES in step S110), as illustrated in
Here, the speakers included in the first editing screen 30 arranged side by side in order of precedence according to at least one of the utterance order and the utterance volume. For example, it is assumed that a speaker of a moderator in the conference tends to utter earlier than other speakers, and tends to have larger utterance volume. Accordingly, on the first editing screen 30, the speakers are arranged side by side in descending order of possibility of editing. This makes it possible to reduce the time and effort of the editing process of the speaker.
When the speaker editing unit 126 detects the selection instruction, it determines that the editing process has occurred, applies the editing process to the partial utterance section identified by the first display control unit 121, edits the speaker of the partial utterance section to be the selected speaker, and displays it. In the present embodiment, the speaker editing unit 126 applies the editing process to the partial utterance section corresponding to the word block “indeed”, edits the speaker “Kimura” of the partial utterance section to be the selected speaker “Kimura”, and displays it. Note that, since there is no substantial change in this example, detailed descriptions will be given later.
When the processing of step S111 is complete, the speaker editing unit 126 determines whether or not the speakers are common (step S112). More specifically, the speaker editing unit 126 determines whether or not the edited speaker and the speaker of the previous utterance section located immediately before the partial utterance section corresponding to the word block of the edited speaker are common. In the present embodiment, the speaker editing unit 126 determines whether or not the edited speaker “Kimura” and the speaker “Oda” of the previous utterance section “ . . . , isn't it?” located immediately before the partial utterance section corresponding to the word block “indeed” of the edited speaker “Kimura” are common. Here, the speaker “Kimura” and the speaker “Oda” are not common, and thus the speaker editing unit 126 determines that the speakers are not common (NO instep S112).
If the speakers are not common, the speaker editing unit 126 skips the processing of steps S113 and S114, and determines whether or not the part after the start point has been processed (step S115). If the speaker editing unit 126 determines that the part after the start point has not been processed (NO in step S115), the first display control unit 121 executes the processing of step S109 as illustrated in
When the second processing of step S109 is complete and the speaker editing unit 126 detects a selection instruction in the processing of step S110, the speaker editing unit 126 edits the speaker in the processing of step S111 (see
When the processing of step S111 is complete, the speaker editing unit 126 again determines whether or not the speakers are common in the processing of step S112. In the present embodiment, the speaker editing unit 126 determines whether or not the edited speaker “Yamada” and the speaker “Yamada” of the utterance section “please ask a question” located immediately after the remaining utterance section corresponding to the plurality of word blocks “yes, I have a question about the material” of the edited speaker “Yamada” are common. Here, the two speakers “Yamada” are common, and thus the speaker editing unit 126 determines that the speakers are common (YES in step S112).
If the speakers are common, the speaker editing unit 126 displays the utterance sections in a combined state (step S113). More specifically, the speaker editing unit 126 displays the utterance sections of the common two speakers after the editing in a combined state. At the same time, the speaker editing unit 126 displays one of the two speakers associated with the respective two utterance sections before the combination in association with the combined utterance section. As a result, the speaker editing unit 126 combines the remaining utterance section corresponding to the plurality of word blocks “yes, I have a question about the material” and the subsequent utterance section “please ask a question”, and displays the two utterance sections in a state of being combined as a new utterance section “yes, I have a question about the material please ask a question” as illustrated in
When the processing of step S113 is complete, the point management unit 127 then saves the division start point location (step S114). More specifically, the point management unit 127 sets the location of the start point for specifying the division of the two utterance sections before combining the utterance sections as division start point location data, and saves it together with the start point corresponding to the location and the end point of the combined utterance section in the point storage unit 115. As a result, the point storage unit 115 stores the division start point location data.
In the present embodiment, as illustrated in
When the processing of step S114 is complete, the speaker editing unit 126 again determines whether or not the part after the start point has been processed in the processing of step S115. If the speaker editing unit 126 determines that the part after the start point has been processed (YES in step S115), the second display control unit 128 then waits until another start point is specified (NO in step S116). When another start point is specified (YES in step S116), the second display control unit 128 displays a second editing screen (step S117). More specifically, as illustrated in
When the processing of step S117 is complete, the speaker editing unit 126 waits until a selection instruction is detected (NO in step S118). When the speaker editing unit 126 detects a selection instruction (YES in step S118), it edits the speaker (step S119). More specifically, as illustrated in
When the processing of step S119 is complete, the second display control unit 128 redisplays the second editing screen (step S120). More specifically, as illustrated in
When the processing of step S120 is complete, the speaker editing unit 126 waits until a selection instruction is detected (NO in step S121). When the speaker editing unit 126 detects a selection instruction (YES in step S121), the point management unit 127 determines whether or not there is a division start point location (step S122). More specifically, the point management unit 127 refers to the point storage unit 115 to determine whether or not division start point location data is stored in the point storage unit 115.
If the point management unit 127 determines that there is a division start point location (YES in step S122), the speaker editing unit 126 edits the speaker up to the division start point location (step S123), and terminates the process. More specifically, as illustrated in
Furthermore, when the speaker editing unit 126 detects the selection instruction, it determines that the editing process has occurred, applies the editing process to the specific utterance section, edits the speaker of the specific utterance section to be the selected speaker, and displays it. In the present embodiment, as illustrated in
On the other hand, if the point management unit 127 determines that there is no division start point location (NO in step S122), the speaker editing unit 126 skips the processing of step S123, and terminates the process. Note that, if there is no division start point location, the speaker editing unit 126 may terminate the process after executing error processing.
As described above, according to the first embodiment, the terminal device 100 includes the processing unit 120, and the processing unit 120 includes the first display control unit 121, the speaker editing unit 126, and the second display control unit 128. The first display control unit 121 displays, on the display unit 140, the information indicating the speaker identified with respect to the sentence data generated on the basis of voice recognition and the utterance section corresponding to the identified speaker in the sentence data in association with each other. In a case where an editing process of editing the identification result of the speaker occurs and respective speakers of two or more adjacent utterance sections are common by the editing process, the speaker editing unit 126 displays, on the display unit 140, the two or more adjacent utterance sections in a combined state. In a case where a start point of the utterance section for performing the editing process of editing the identification result of the speaker is specified for a specific utterance section within the combined two or more utterance sections, and in a case where there is a location corresponding to a start point of any of the two or more sections before the combination between the specified start point and the end point of the combined two or more utterance sections, the second display control unit 128 applies the editing process to the utterance section from the specified start point to that location. This makes it possible to improve the convenience of the editing process for the identification result of the speaker.
In particular, in a case where a learned model or a predetermined voice model is used to identify a speaker and the speaker utters a short word block, characteristics of the voice of the speaker may not be sufficiently discriminated, and the speaker may not be identified accurately. Examples of the short word block include a word block of about several characters, such as “yes”. In a case where the speaker cannot be identified accurately, there is a possibility that the terminal device 100 displays an erroneous identification result. Even in such a case, according to the present embodiment, it becomes possible to improve the convenience of an editing process for an identification result of a speaker.
Next, a second embodiment of the present case will be described with reference to
For example, regarding the characters “ques-” and “-tion” (Chinese characters) having a common identifier “09” of the word block as illustrated in
As described above, according to the second embodiment, it becomes possible to improve the convenience of an editing process for an identification result of a speaker even in the case of editing the speaker in units of characters.
Next, a third embodiment of the present case will be described with reference to
The editing support system ST includes a terminal device 100 and a server device 200. The terminal device 100 and the server device 200 are connected via a communication network NW. Examples of the communication network NW include a local area network (LAN), the Internet, and the like.
As illustrated in
In this case, the input unit 130 of the terminal device 100 is operated, and the voice data of the conference described above is stored in the storage unit 110 (more specifically, voice storage unit 111) via the two communication units 150 and 160. Furthermore, the input unit 130 is operated, and the voice data of the speaker described above is input to the processing unit 120 (more specifically, speaker identification unit 124) via the two communication units 150 and 160.
The processing unit 120 accesses the storage unit 110, obtains voice data of a conference, and performs various kinds of processing described in the first embodiment on the voice data of the conference to generate sentence data. Furthermore, the processing unit 120 generates a learned model in which characteristics of voice of a speaker are machine-learned on the basis of input voice data of the speaker. Then, the processing unit 120 identifies the speaker on the basis of the voice data of the conference and the learned model. The processing unit 120 outputs, to the communication unit 160, screen information of an editing support screen 10 that displays the identified speaker and the utterance section corresponding to the speaker in association with each other as a processing result. The communication unit 160 transmits the processing result to the communication unit 150, and the communication unit 150 outputs screen information to the display unit 140 upon reception of the processing result. As a result, the display unit 140 displays the editing support screen 10.
As described above, the terminal device 100 may not include the storage unit 110 and the processing unit 120, and the server device 200 may include the storage unit 110 and the processing unit 120. Furthermore, the server device 200 may include the storage unit 110, and another server device (not illustrated) connected to the communication network NW may include the processing unit 120. Such a configuration may be used as an editing support system. Even in such an embodiment, it becomes possible to improve the convenience of an editing process for an identification result of a speaker.
Although the preferred embodiments of the present invention have been described in detail thus far, the present invention is not limited to specific embodiments according to the present invention, and various modifications and alterations may be made within the scope of the gist of the present invention described in the claims. For example, in the embodiment described above, it has been described that the first editing screen 30 is successively and dynamically displayed. Meanwhile, the switching point 17 may be moved with a cursor key, and the first editing screen 30 may be displayed each time the enter key is pressed. Such control may be applied to the second editing screen 40. Furthermore, in a case where participant data is not registered, an identification character or an identification symbol may be adopted as an identification result instead of a speaker.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
This application is a continuation application of International Application PCT/JP2019/010793 filed on Mar. 15, 2019 and designated the U.S., the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2019/010793 | Mar 2019 | US |
Child | 17412472 | US |