The present invention relates to a speaker diarization method, a speaker diarization apparatus, and a speaker diarization program.
In recent years, a speaker diarization technique which accepts an acoustic signal as input and which identifies utterance sections of all speakers included in the acoustic signal has been anticipated. The speaker diarization technique can be applied in various ways such as automatic transcription which records who and when an utterance had been made in a conference and automatic segmentation of utterances between an operator and a customer from a call at a contact center. Conventionally, a technique called EEND (End-to-End Neural Diarization) based on deep learning has been disclosed as a speaker diarization technique (refer to NPL 1). In EEND, an acoustic signal is divided into frames and a speaker label representing whether or not a specific speaker exists in a frame is estimated for each frame from an acoustic feature extracted from the frame. When the maximum number of speakers in the acoustic signal is denoted by S, the speaker label for each frame is an S-dimensional vector which takes a value of 1 when a certain speaker speaks in the frame but takes a value of 0 when the speaker does not speak in the frame. In other words, in EEND, speaker diarization is realized by performing multi-label binary classification as many times as the number of speakers.
An EEND model used for estimating a speaker label sequence for each frame in EEND is a deep learning-based model which is made up of layers capable of backpropagation and which enables the speaker label sequence for each frame to be estimated from an acoustic feature sequence in a comprehensive manner. The EEND model includes an RNN (Recurrent Neural Network) layer for performing time-series modeling. Accordingly, in EEND, the speaker label for each frame can be estimated by using an acoustic feature amount of not only the frame but also surrounding frames thereof. A bidirectional LSTM (Long Short-Term Memory)-RNN or a Transformer Encoder is used for the RNN layer.
However, in prior art, it has been difficult to perform speaker diarization with respect to a long acoustic signal with high accuracy. In other words, since it is difficult for the RNN layer to handle a very long acoustic feature sequence in a conventional EEND model, when a very long acoustic signal is input, there is a possibility that errors in speaker diarization may increase.
For example, when using a BLSTM-RNN as the RNN, the BLSTM-RNN uses internal states of an input frame and an adjacent frame thereof to estimate a speaker label of the input frame. Therefore, the farther a frame is from the input frame, the more difficult it is to use an acoustic feature of the frame to estimate a speaker label.
In addition, when a transformer encoder is used as the RNN, an EEND model is trained so as to estimate which frame contains information useful for estimating the speaker label of the frame. Therefore, as an acoustic feature sequence becomes longer, choices of frame estimation increases to make it difficult to estimate a speaker label.
The present invention has been devised in view of the foregoing circumstances and an object thereof is to perform speaker diarization with respect to a long acoustic signal with high accuracy.
In order to solve the problem and achieve the object described above, a speaker diarization method according to the present invention includes the steps of: dividing a sequence of acoustic features for each frame of an acoustic signal into segments of a predetermined length and generating an array in which a plurality of divided segments in a row direction are arranged in a column direction; and generating by learning, using the array, a model for estimating a speaker label of a speaker vector of each frame.
According to the present invention, speaker diarization with respect to a long acoustic signal can be performed with high accuracy.
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited by the present embodiment. Furthermore, in the description of the drawings, same parts are denoted by same reference signs. [Overview of speaker diarization apparatus]
Specifically, the speaker diarization apparatus divides a two-dimensional acoustic feature sequence of T-number of frames×D-number of dimensions into segments of L-number of frames by a shift width of N-number of frames. In addition, with each segment as each row, heads of the respective rows are connected so as to be aligned in the column direction to generate a three-dimensional acoustic feature array of (T-L)/N-number of rows×L-number of columns×D-number of dimensions.
A row-oriented RNN layer for performing RNN processing on each row is applied to the array generated in this manner and a hidden layer output is obtained using the acoustic feature sequence in each segment. Subsequently, a column-oriented RNN layer for performing RNN processing on each column is applied to the array to obtain a hidden layer output sequence that straddles a plurality of segments and an embedded sequence to be used to estimate a speaker label for each frame is obtained. In addition, each row of the embedded sequence for each frame is overlapped and added to obtain a speaker label embedded sequence for each frame of the T-number of frames. Thereafter, the speaker diarization apparatus obtains a speaker label sequence for each frame using a Linear layer and a sigmoid layer.
In this manner, by applying the row-oriented RNN layer, the speaker diarization apparatus can perform speaker diarization using local contextual information. In this case, a same speaker label tends to be output in adjacent frames. Furthermore, by applying the column-oriented RNN layer, the speaker diarization apparatus can perform speaker diarization using global contextual information. Accordingly, utterances by a same speaker separated in time can be adopted as objects of speaker diarization.
[Configuration of Speaker Diarization Apparatus]
The input unit 11 is implemented using an input device such as a keyboard or a mouse and receives various types of instruction information such as a processing start instruction for the control unit 15 in accordance with input operations performed by an operator. The output unit 12 is implemented by a display apparatus such as a liquid crystal display, a printing apparatus such as a printer, an information communication apparatus, or the like. The communication control unit 13 is implemented by an NIC (Network Interface Card) or the like and controls communication between an external apparatus such as a server or an apparatus that acquires an acoustic signal and the control unit 15 via a network.
The storage unit 14 is implemented by a semiconductor memory device such as a RAM (Random Access Memory) or a flash memory, or a storage apparatus such as a hard disk or an optical disc. Note that the storage unit 14 may also be configured to communicate with the control unit 15 via the communication control unit 13. In the present embodiment, the storage unit 14 stores, for example, a speaker diarization model 14a or the like used for speaker diarization processing to be described later.
The control unit 15 is implemented by using a CPU (Central Processing Unit), an NP (Network Processor), an FPGA (Field Programmable Gate Array), or the like and executes a processing program stored in a memory. Accordingly, as illustrated in
The acoustic feature extracting unit 15a extracts an acoustic feature for each frame of an acoustic signal including an utterance by a speaker. For example, the acoustic feature extracting unit 15a receives input of an acoustic signal via the input unit 11 or via the communication control unit 13 from an apparatus or the like that acquires the acoustic signal. In addition, the acoustic feature extracting unit 15a divides an acoustic signal into frames, extracts an acoustic feature vector by performing discrete Fourier transform or filter bank multiplication on a signal from each frame, and outputs an acoustic feature sequence having been coupled in a frame direction. In this embodiment, a frame length is assumed to be 25 ms and a frame shift width is assumed to be 10 ms.
While the acoustic feature vector in this case is, for example, a 24-dimensional MFCC (Mel Frequency Cepstral Coefficient), the acoustic feature vector is not limited thereto and may be an another acoustic feature amount for each frame such as a mel filter bank output.
The array generating unit 15b divides a sequence of acoustic features for each frame of the acoustic signal into segments of a predetermined length and generates an array in which a plurality of divided segments in the row direction are arranged in the column direction. Specifically, the array generating unit 15b divides an input two-dimensional acoustic feature sequence into segments and converts the segments into a three-dimensional acoustic feature array as shown in
The array generating unit 15b may be included in the learning unit 15d and the estimating unit 15e to be described later. For example,
The speaker label generating unit 15c uses an acoustic feature sequence to generate a speaker label of each frame. Specifically, as shown in
When there are S-number of speakers (speaker 1, speaker 2, . . . , speaker S), a speaker label of a t-th frame (t=0, 1, . . . , T) is a S-dimensional vector. For example, when a frame of time point t×frame shift width is included in an utterance section of any speaker, a value of a dimension corresponding to the speaker is 1 and values of other dimensions are 0. Therefore, the speaker label for each frame is a T×S-dimensional binary [0, 1] multi-label.
Let us return to the description of
In addition, the speaker diarization model 14a has an overlap addition layer. As shown in
Furthermore, the speaker diarization model 14a has a Linear layer for performing linear transformation and a sigmoid layer for applying a sigmoid function. As shown in
Using a posterior probability of a speaker label for each frame and multi-label binary cross entropy with the speaker label for each frame as a loss function, the learning unit 15d optimizes parameters of the linear layer, the row-oriented BLSTM-RNN layer, and the column-oriented BLSTM-RNN layer of the speaker diarization model 14a by backpropagation. The learning unit 15d uses an online optimization algorithm using a stochastic gradient descent method to optimize the parameters.
In this way, the learning unit 15d generates the speaker diarization model 14a including an RNN for processing the array in the row direction and an RNN for processing the array in the column direction. Accordingly, speaker diarization using local contextual information and speaker diarization using global contextual information can be performed. Therefore, the learning unit 15d can learn utterances of a same speaker separated in time as objects of speaker diarization.
Let us now return to the description of
Next, when a value of the calculated moving average is larger than 0.5, the utterance section estimating unit 15f estimates that the frame is an utterance section of a speaker of the dimension. In addition, the utterance section estimating unit 15f regards a continuous utterance section frame group as one utterance for each speaker, and inversely calculates a start time and an end time of the utterance section up to a prescribed time point from the frame. Accordingly, an utterance start time point and an utterance end time point up to the prescribed time point for each utterance of each speaker can be obtained.
[Speaker Diarization Processing]
Next, speaker diarization processing by the speaker diarization apparatus 10 will be described.
First, the acoustic feature extracting unit 15a extracts an acoustic feature for each frame of an acoustic signal including an utterance of a speaker and outputs an acoustic feature sequence (step S1).
Next, the array generating unit 15b divides a two-dimensional acoustic feature sequence for each frame of the acoustic signal into segments of a predetermined length, and generates a three-dimensional acoustic feature array in which a plurality of divided segments in a row direction are arranged in a column direction (step S2).
In addition, using the generated acoustic feature array, the learning unit 15d generates, by learning, the speaker diarization model 14a for estimating a speaker label of a speaker vector of each frame (step S3). In doing so, the learning unit 15d generates the speaker diarization model 14a including an RNN for processing the array in the row direction and an RNN for processing the array in the column direction. Accordingly, a series of the learning processing is ended.
Next,
First, the acoustic feature extracting unit 15a extracts an acoustic feature for each frame of an acoustic signal including an utterance of a speaker and outputs an acoustic feature sequence (step S1).
In addition, the array generating unit 15b divides a two-dimensional acoustic feature sequence for each frame of the acoustic signal into segments of a predetermined length, and generates a three-dimensional acoustic feature array in which a plurality of divided segments in a row direction are arranged in a column direction (step S2).
Next, using the generated speaker diarization model 14a, the estimating unit 15e estimates a speaker label for each frame of the acoustic signal (step S4). Specifically, the estimating unit 15e outputs speaker label posterior probability (an estimated value of a speaker label) for each frame of the acoustic feature sequence.
In addition, using the output speaker label posterior probability, the utterance section estimating unit 15f estimates an utterance section of a speaker in the acoustic signal (step S5). Accordingly, the series of estimation processing is ended.
As described above, in the speaker diarization apparatus 10 according to the present embodiment, the array generating unit 15b divides a sequence of acoustic features for each frame of an acoustic signal into segments of a predetermined length and generates an array in which a plurality of divided segments in the row direction are arranged in the column direction. In addition, using the generated array, the learning unit 15d generates, by learning, the speaker diarization model 14a for estimating a speaker label of a speaker vector of each frame.
Specifically, the learning unit 15d generates the speaker diarization model 14a including an RNN for processing the array in the row direction and an RNN for processing the array in the column direction. Accordingly, speaker diarization using local contextual information and speaker diarization using global contextual information can be performed. Therefore, the learning unit 15d can learn utterances of a same speaker separated in time as objects of speaker diarization. Accordingly, the speaker diarization apparatus 10 can perform speaker diarization with respect to a long acoustic signal with high accuracy.
In addition, using the generated speaker diarization model 14a, the estimating unit 15e estimates a speaker label for each frame of the acoustic signal. Accordingly, highly-accurate speaker diarization with respect to a long acoustic signal can be performed.
Furthermore, the utterance section estimating unit 15f estimates a speaker label using a moving average of a plurality of frames. Accordingly, an erroneous detection of an unrealistically-short utterance section can be prevented.
[Program]
It is also possible to create a program that describes, in a computer-executable language, the processing executed by the speaker diarization apparatus 10 according to the embodiment described above. In an embodiment, the speaker diarization apparatus 10 can be implemented by installing, in a desired computer, a speaker diarization program for executing the speaker diarization processing described above as packaged software or online software. For example, it is possible to cause an information processing apparatus to function as the speaker diarization apparatus 10 by causing the information processing apparatus to execute the speaker diarization program described above. Additionally, mobile communication terminals such as smartphones, mobile phones, and PHSs (Personal Handyphone Systems), slate terminals such as PDAs (Personal Digital Assistant), and the like are included in the scope of information processing apparatuses. Furthermore, functions of the speaker diarization apparatus 10 may be mounted to a cloud server.
The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1031. The disc drive interface 1040 is connected to a disc drive 1041. A detachable storage medium such as a magnetic disk or an optical disc is inserted into the disc drive 1041. For example, a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050. For example, a display 1061 is connected to the video adapter 1060.
In this case, for example, the hard disk drive 1031 stores an OS 1091, an application program 1092, a program module 1093, and program data 1094. Each of the pieces of information described in the above embodiment is stored in, for example, the hard disk drive 1031 or the memory 1010.
In addition, for example, the speaker diarization program is stored in the hard disk drive 1031 as the program module 1093 in which commands to be executed by the computer 1000 are written. Specifically, the program module 1093 describing each type of processing executed by the speaker diarization apparatus 10 described in the above embodiment is stored in the hard disk drive 1031.
Furthermore, for example, data to be used in information processing in accordance with the speaker diarization program is stored as the program data 1094 in the hard disk drive 1031. In addition, the CPU 1020 reads out and loads the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 when necessary to execute each of the above-described procedures.
Note that the program module 1093 and the program data 1094 pertaining to the speaker diarization program are not limited to being stored in the hard disk drive 1031 and, for example, may be stored in a removable storage medium and read out by the CPU 1020 via the disk drive 1041 or the like.
Alternatively, the program module 1093 and the program data 1094 pertaining to the speaker diarization program may be stored in another computer that is connected via a network such as a LAN (Local Area Network) or a WAN (Wide Area Network) to be read by the CPU 1020 via the network interface 1070.
Although an embodiment to which has been applied the invention made by the present inventor has been described above, the present invention is not limited by the description and the drawings that form a part of the disclosure of the present invention by way of the present embodiment. In other words, other embodiments, examples, operational techniques, and the like made by those skilled in the art or the like on the basis of the present embodiment are all included in the scope of the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/046585 | 12/14/2020 | WO |