The present invention relates to a speaker diarization method, a speaker diarization device, and a speaker diarization program.
In recent years, expectations have been high for a speaker diarization technique which uses an acoustic signal as an input and identifies speech sections of all speakers included in the acoustic signal. According to a speaker diarization technique, various applications are possible, such as, for example, automatic transcription in which a person who speaks and a time at which the person speaks are recorded in a conference, automatic extraction of speech between an operator and a customer from a call in a contact center, and the like.
In the related art, a technique called end-to-end neural diarization (EEND) based on deep learning has been disclosed as a speaker diarization technique (refer to NPL 1). In the EEND, an acoustic signal is divided into frames and a speaker label indicating whether a specific speaker exists in the frame is estimated for each frame from acoustic features extracted from each frame. When the maximum number of speakers in the audio signal is S, the speaker label for each frame is an S-dimensional vector which is 1 when a speaker is speaking and is 0 when the speaker is not speaking in that frame. That is to say, the EEND implements speaker diarization by performing multi-label binary classification of the number of speakers.
The EEND model used for estimating the speaker label sequence for each frame in the EEND is a deep learning-based model composed of layers capable of backpropagating errors and can estimate the speaker label sequence for each frame from the acoustic feature sequence all at once. The EEND model includes a recurrent neural network (RNN) layer which performs time-series modeling. As a result, in the EEND, it is possible to estimate the speaker label for each frame by using the acoustic feature amount of not only the current frame but also the surrounding frames. Bidirectional long short-term memory (LSTM)-RNN and Transformer Encoder are used in this RNN layer.
Note that NPL 2 describes the RNN Transducer. In addition, NPL 3 describes an acoustic feature amount.
However, on-line speaker diarization is difficult in the related art. In other words, since the EEND model in the related art uses a bidirectional LSTM-RNN or Transformer which refers to the entire acoustic feature sequence, it is difficult to achieve online speaker diarization.
The present invention was made in view of the above description, and an object of the present invention is to perform on-line speaker diarization.
In order to solve the above-described problems and achieve the object, a speaker diarization method according to the present invention includes: an extraction step of extracting a speaker vector representing speaker features of each frame using an acoustic feature sequence for each frame of a most recent acoustic signal; and a learning step of generating a model for estimating a speaker label of a speaker vector of each frame by performing learning using the speaker vector and a speaker label representing a speaker of the estimated speaker vector.
According to the present invention, on-line speaker diarization becomes possible.
An embodiment of the present invention will be described in detail below with reference to the drawings. Note that the present invention is not limited by this embodiment. Moreover, in the description provided with reference to the drawings, the same constituent elements will be denoted by the same reference numerals.
This online EEND model 14a has a speaker feature extraction block, a speaker feature update block, and a speaker label estimation block. Here, the speaker feature extraction block uses the acoustic features of each of the (t-N)th to tth frames to extract a speaker vector representing the feature of the tth frame speaker. Note that, although the speaker feature extraction block includes a Linear (fully connected) layer and an RNN layer in the example shown in
The speaker feature update block vector-connects and stores the speaker vector of the tth frame and the estimated speaker label estimated by the speaker label estimation block which will be described later for this speaker vector. Furthermore, the speaker feature update block updates the parameters of a model in which a speaker vector having information which identifies a speaker is output as a stored speaker vector in response to an input vector obtained by vector-connecting the stored speaker vector and an estimation value of the speaker label. In the example shown in
The speaker label estimation block uses the speaker vector and the stored speaker vector to output the speaker label estimation value for the tth frame. In the example shown in
In this way, the speaker diarization device estimates speaker labels frame by frame using the online EEND model 14a having an autoregressive structure. This allows the speaker diarization device to estimate speaker labels while updating the stored speaker vectors each time a frame is input. Therefore, it is possible to realize on-line speaker diarization.
The input unit 11 is implemented using an input device such as a keyboard and a mouse and inputs various instruction information such as processing start to the control unit 15 in response to an input operation by a practitioner. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The communication control unit 13 is implemented by a network interface card (NIC) or the like and controls communication between the control unit 15 and an external device such as a server or a device which acquires an acoustic signal over a network.
The storage unit 14 is implemented by a semiconductor memory device such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disc. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13. In the embodiment, the storage unit 14 stores, for example, the online EEND model 14a used for speaker diarization processing which will be described later.
The control unit 15 is implemented using a central processing unit (CPU), a network processor (NP), a field programmable gate array (FPGA), or the like and executes a processing program stored in a memory. As a result, as illustrated in
The acoustic feature extraction unit 15a extracts an acoustic feature for each frame of the acoustic signal including the speech of the speaker. For example, the acoustic feature extraction unit 15a receives an input of an acoustic signal via the input unit 11 or via the communication control unit 13 from a device which acquires an acoustic signal and the like. Furthermore, the acoustic feature extraction unit 15a divides the acoustic signal for each frame, extracts an acoustic feature vector by performing discrete Fourier transform or filter bank multiplication on the signal from each frame, and outputs an acoustic feature sequence coupled in a frame direction. In the embodiment, a frame length is 25 ms and a frame shift width is 10 ms.
Here, although the acoustic feature vector is, for example, a 24-dimensional Mel frequency cepstral coefficient (MFCC), the present invention is not limited to this. The acoustic feature vector may be another acoustic feature amount for each frame such as a Mel filter bank output.
The speaker vector extraction unit 15b extracts a speaker vector representing the speaker feature of each frame using the acoustic feature sequence for each frame of the latest acoustic signal. Specifically, the speaker vector extraction unit 15b generates a speaker vector by inputting the acoustic feature sequence acquired from the acoustic feature extraction unit 15a to the speaker feature extraction block shown in
Note that the speaker vector extraction unit 15b may be included in a learning unit 15d and an estimation unit 15e which will be described later. For example,
The speaker label generation unit 15c uses the acoustic feature sequence to generate a speaker label for each frame. Specifically, as shown in
Here, when the number of speakers is S (speaker 1, speaker 2, . . . , speaker S), the speaker label of the tth frame (t=0, 1, . . . , T) is an S-dimensional vector. For example, when a frame of time txframe shift width is included in the speech section of any speaker, the value of the dimension corresponding to that speaker is 1 and the value of the other dimensions is 0. Therefore, the speaker label for each frame is a binary [0, 1] multi-label of T×S dimensions.
The description will be provided with reference to
Here, the online EEND model 14a is composed of a plurality of layers including the RNN layer as shown in
Furthermore, the online EEND model 14a also outputs the posterior probability of the speaker label for each frame in T×S dimensions. The learning unit 15d optimizes the parameters of each layer of the online EEND model 14a through backpropagating errors using the posterior probability of the speaker label for each frame and the multi-label binary cross entropy with the speaker label for each frame as the loss function. The learning unit 15d uses an online optimization algorithm using stochastic gradient descent for parameter optimization.
That is to say, the learning unit 15d vector-connects and stores the tth frame speaker vector extracted by the speaker vector extraction unit 15b which is a speaker feature extraction block using the acoustic features of each of the (t-N)th frame to the tth frame of the teacher data and the speaker label estimation value estimated by the speaker label estimation block for this speaker vector. Furthermore, the learning unit 15d inputs a vector obtained by vector-connecting the stored speaker vector and the estimation value of the speaker label into the speaker feature update block and updates the parameters of the model which outputs the stored speaker vector including the information identifying the speaker. In addition, the learning unit 15d inputs the speaker vector of the tth frame and the stored speaker vector to the speaker label estimation block and updates the parameters of the model that outputs the estimation value of the speaker label of the tth frame.
Thus, the learning unit 15d generates the online EEND model 14a using a plurality of stored combinations of speaker vectors and speaker labels of the estimated speaker vectors. This makes it possible to estimate the speaker label while updating the stored speaker vector each time a frame is input.
The description will be provided with reference to
Since the online EEND model 14a has an autoregressive structure, the speaker label posterior probability (estimation value of the speaker label) for each frame of the acoustic feature sequence is output by successively sequentially propagating the acoustic feature sequence from the first frame.
The speech section estimation unit 15f uses the output speaker label posterior probability to estimate the speech section of the speaker in the acoustic signal. Specifically, the speech section estimation unit 15f estimates the speaker label using a moving average of a plurality of frames. That is to say, the speech section estimation unit 15f first calculates a moving average of the speaker label posterior probability for each frame over the length 6 of the current frame and the five frames immediately preceding it. This makes it possible to prevent erroneous detection of impractically short speech sections such as speech with only one frame.
Subsequently, when the calculated moving average value is greater than 0.5, the speech section estimation unit 15f estimates that the frame is the speech section of the speaker of the dimension. Moreover, for each speaker, the speech section estimation unit 15f regards a group of continuous speech section frames as one speech and calculates a start time and an end time of the speech section up to a predetermined time from the frames. Thus, the speech start time and the speech end time up to a predetermined time can be obtained for each speech of each speaker.
The speaker diarization processing by the speaker diarization device 10 will be described below.
First, the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal including speech of a speaker and outputs an acoustic feature sequence (Step S1).
Subsequently, the speaker vector extraction unit 15b extracts a speaker vector representing the speaker feature of each frame using the acoustic feature sequence for each frame of the latest acoustic signal. (Step S2)
Furthermore, the learning unit 15d has an autoregressive structure using the speaker vector and the speaker label representing the speaker of the estimated speaker vector and generates the online EEND model 14a for estimating the speaker label of the speaker vector of each frame through learning (Step S3). This completes a series of learning processings.
Subsequently,
First, the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal including speech of a speaker and outputs an acoustic feature sequence (Step S1).
Also, the speaker vector extraction unit 15b extracts a speaker vector representing the speaker feature of each frame using the acoustic feature sequence for each frame of the latest acoustic signal (Step S2).
Subsequently, the estimation unit 15e uses the generated online EEND model 14a to estimate the speaker label for each frame of the acoustic signal (Step S4). Specifically, the estimation unit 15e outputs the speaker label posterior probability (estimation value of the speaker label) for each frame of the acoustic feature sequence.
Furthermore, the speech section estimation unit 15f uses the output speaker label posterior probability to estimate the speaker's speech section in the acoustic signal (Step S5). This completes a series of estimation processings.
As described above, in the speaker diarization device 10 of the embodiment, the speaker vector extraction unit 15b uses the acoustic feature sequence for each frame of the latest acoustic signal to extract a speaker vector representing the speaker feature of each frame. Also, the learning unit 15d generates the online EEND model 14a for estimating the speaker label of the speaker vector of each frame through learning using the speaker vector and the speaker label representing the speaker of the estimated speaker vector.
Thus, the speaker diarization device 10 can estimate a speaker label each time a frame is input by using the online EEND model 14a having an autoregressive structure. Therefore, it is possible to realize on-line speaker diarization.
In addition, the learning unit 15d generates the online EEND model 14a using a plurality of stored combinations of speaker vectors and the speaker labels of the estimated speaker vectors. This enables the speaker diarization device 10 to estimate speaker labels while updating the stored speaker vectors each time a frame is input. Therefore, on-line speaker diarization can be realized with higher accuracy.
Also, the estimation unit 15e estimates the speaker label for each frame of the acoustic signal using the generated online EEND model 14a. This enables on-line speaker diarization.
Also, the speech section estimation unit 15f estimates the speaker label using the moving average of a plurality of frames. This makes it possible to prevent erroneous detection of impractically short speech sections.
It is also possible to create a program in which the processing executed by the speaker diarization device 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the speaker diarization device 10 can be implemented by installing a speaker diarization program for executing the above-described speaker diarization processing as package software or online software on a desired computer. For example, the information processing device can function as the speaker diarization device 10 by causing the information processing device to execute the speaker diarization program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and personal handyphone systems (PHSs), and slate terminals such as personal digital assistant (PDA). Also, the functions of the speaker diarization device 10 may be implemented in a cloud server.
The memory 1010 includes a read only memory (ROM) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as a basic input output system (BIOS). The hard disk drive interface 1030 is connected to a hard disk drive 1031. The disk drive interface 1040 is connected to a disk drive 1041. A removable storage medium such as a magnetic disk or an optical disk is, for example, inserted into the disk drive 1041. A mouse 1051 and a keyboard 1052 are, for example, connected to the serial port interface 1050. A display 1061 is, for example, connected to the video adapter 1060.
Here, the hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. Each piece of information described in the above embodiment is stored in, for example, the hard disk drive 1031 or the memory 1010.
Also, the speaker diarization program is stored on the hard disk drive 1031 as, for example, the program module 1093 in which instructions to be executed by a computer 1000 are described. Specifically, the hard disk drive 1031 stores a program module 1093 in which each processing executed by the speaker diarization device 10 described in the above embodiment is described.
Also, data used for information processing by the speaker diarization program is stored, for example, as the program data 1094 in the hard disk drive 1031. Furthermore, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary and performs each procedure described above.
Note that the present invention is not limited to a case in which the program module 1093 and the program data 1094 relating to the speaker diarization program are stored in the hard disk drive 1031 and the program module 1093 and the program data 1094 relating to the speaker diarization program may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. Alternatively, the program modules 1093 and the program data 1094 relating to the speaker diarization program may be stored in other computers connected over a network such as a local area network (LAN) or a wide area network (WAN) and read by the CPU 1020 via the network interface 1070.
Although the embodiments to which the invention made by the present inventor is applied have been described above, the present invention is not limited by the descriptions and drawings forming a part of the disclosure of the present invention according to the embodiments. That is to say, other embodiments, examples, operation techniques, and the like made by those skilled in the art on the basis of the embodiment are all included in the scope of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/046117 | 12/10/2020 | WO |