The present invention relates to a pre-training method, a pre-training apparatus, and a pre-training program.
In recent speech recognition systems using a neural network, it is possible to directly output a word sequence from a speech feature amount. For example, a training method of an End-to-End speech recognition system that directly outputs a word sequence from an acoustic feature amount has been proposed (refer to NPL 1, for example).
A method for training a neural network for speech recognition using a training method according to the recurrent neural network transducer (RNN-T) is described in the section “Recurrent Neural Network Transducer” in NPL 1. By introducing a “blank” symbol (described as “null output” in NPL 1) representing redundancy in training of an RNN-T model, it is possible to dynamically train correspondence between speech and output sequences from training data if only content of speech and corresponding phonemes/characters/subwords/word sequences (≠ frame-by-frame) are provided. That is, in training of the RNN-T model, it is possible to perform training using a feature amount and a label of a non-corresponding relationship between an input length I and an output length U (generally T>>D).
However, it is difficult to train the RMM-T model that dynamically allocates phonemes/characters/subwords/words and a blank symbol to each speech frame as compared to an acoustic model of a conventional speech recognition system.
In order to solve this problem, NPL 2 proposes a pre-training method capable of stably training RNN-T. This technology uses a label of a senone (a label in a unit finer than a phoneme) sequence used for training a DNN acoustic model of a conventional speech recognition system (DNN-HMM hybrid speech recognition system). If this senone sequence is used, the position and section of each phoneme/character/subword/word can be ascertained. Frame intervals are evenly allocated to the input frame intervals corresponding to each phoneme/character/subword/word by the number of frame intervals divided by the number of phonemes/characters/subwords/words.
For example, when t=10 and u=5 of u=2, resulting in u=10 of Therefore, a label of a phoneme/character/subword/word is extended to a frame-by-frame label. That is, a sequence length U of a phoneme/character/subword/word is extended to the same length as an input length T.
For each pair of an input feature amount and such an extended frame-by-frame label, processing of the above-described intermediate feature amount extraction, output probability calculation, and model update is repeated in this order, and a model after a predetermined number (conventionally tens of millions to hundreds of millions) of repetitions are completed is used as a trained model.
According to this method, a label in units of frames close to the final output (each phoneme/character/subword/word) can be used, and thus stable pre-training can be performed. In addition, it has been reported that a model having higher performance than a model initialized by random numbers can be constructed by fine tuning of a pre-trained parameter according to RNN-T loss.
In the technology described in NPL 2, a label of a senone (label in a unit finer than a phoneme) sequence used in training of a DNN acoustic model of a conventional speech recognition system (DNN-HMM hybrid speech recognition system) is used to create a frame-by-frame label. Creating this senone sequence label requires a very high degree of linguistic expertise, which is inconsistent with the concept of modeling (End-to-End speech recognition model) methods that do not require such expertise. Further, in the method described in NPL 2, the output of the device becomes a three-dimensional tensor, and thus it is difficult to perform calculation according to cross entropy (CE) loss, and costs such as memory consumption and training time during training increase.
An object of the present invention in view of the above-described circumstances is to provide a pre-training method, a pre-training apparatus, and a pre-training program capable of generating a frame-by-frame label without using a label of a senone sequence and easily calculating CE loss.
In order to solve the above problem and achieve the object, a pre-training method according to the present invention is a training method executed by a training apparatus, including: a first conversion process of converting an input acoustic feature amount sequence into a corresponding intermediate acoustic feature amount sequence having a first length using a first conversion model to which a conversion model parameter is provided; a second conversion process of converting a correct answer symbol sequence to generate a first frame unit symbol sequence having the first length and generating a second frame unit symbol sequence having the first length by delaying the first frame unit symbol sequence by one frame; a third conversion process of converting the second frame unit symbol sequence into an intermediate character feature amount sequence having the first length using a second conversion model to which a character feature amount estimation model parameter is provided; an estimation process of performing label estimation using an estimation model to which an estimation model parameter is provided based on the intermediate acoustic feature amount sequence and the intermediate character feature amount sequence and outputting an output probability distribution of a two-dimensional matrix; and a calculation process of calculating a cross entropy (CE) loss of the output probability distribution with respect to the first frame unit symbol sequence based on the first frame unit symbol sequence and the output probability distribution.
According to the present invention, it is possible to generate a frame-by-frame label without using a label of a senone sequence and easily calculate a CE loss.
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to the present embodiment. Further, in the description of the drawings, the same parts are denoted by the same reference signs.
[Embodiment.] Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to the present embodiment. Further, in the description of the drawings, the same parts are denoted by the same reference signs.
In the embodiment, a training apparatus for training a speech recognition model will be described. Prior to the description of the training apparatus according to the embodiment, a training apparatus according to prior art will be described as background art. The training apparatus according to the present embodiment is a pre-training apparatus for performing pre-training for satisfactory initialization of model parameters, and a pre-trained model in the training apparatus according to the present embodiment is further trained (fine-tuned according to RNN-T loss).
[Background Art]
The speech distribution expression sequence conversion unit 101 includes an encoder function for converting an input acoustic feature amount sequence X into an intermediate acoustic feature amount sequence H by a multi-stage neural network and outputs the intermediate acoustic feature amount sequence H.
The symbol distribution expression sequence conversion unit 102 converts an input symbol sequence c (length U) or a symbol sequence c (length T) into an. intermediate character feature amount sequence C (length U) or an intermediate character feature amount sequence C (length T) of a corresponding continuous value, and outputs the intermediate character feature amount sequence C. The symbol distribution expression sequence conversion unit 102 has an encoder function for converting the input symbol sequence c into a one-hot vector temporarily and converting the vector into an intermediate character feature amount sequence C (length U) or an intermediate character feature amount sequence C (length T) by a multi-stage neural network.
The label estimation unit 103 receives the intermediate acoustic feature amount sequence H, the intermediate character feature amount sequence C (length U), or the intermediate character feature amount sequence C (length T) and estimates a label from the intermediate acoustic feature amount sequence H, the intermediate character feature amount sequence C (length U) or the intermediate character feature amount sequence C (length T) by a neural network. The label estimation unit 103 outputs, as an estimation result, an output probability distribution Y (three-dimensional tensor) or an output probability distribution Y (2-dimensional matrix).
Here, in processing of the label estimation unit 103, a case in which the input is the intermediate character feature amount sequence C (length U) will be described. The output probability distribution Y is obtained on the basis of formula (1).
[Math. 1]
y
t,i=Softmax(W3(tanh(W1ht+W2cu+b))) (1)
When the dimensions of t and u are different, the output probability distribution Y becomes a three-dimensional tensor because as many dimensions as the number of elements of a neural network are also present in addition to t and u. Specifically, at the time of adding, W1H is extended by copying the same value in the dimensional direction of U, and W2C is extended by copying the same value in the dimensional direction of T in The same manner to arrange dimensions, and then three-dimensional tensors are added to each other. Therefore, the output of the label estimation unit 103 also becomes a three-dimensional tensor.
In addition, a case in which the input of the label estimation unit 103 is the intermediate character feature amount sequence C (length T) will be described. The output probability distribution Y is obtained on the basis of formula (2).
[Math. 2]
y
t=Softmax(M3(tanh(W1h1+W2ct+b))) (2)
When the dimensions of t and u are identical, there is no extending operation as in the case of using formula (1), and thus the output of the label estimation unit 103 becomes a two-dimensional matrix of the dimension t in the time direction and the dimension of the number of elements of the neural network.
In general, at the time of RNN-T training, training is performed according to RNN-T loss on the assumption that output becomes a three-dimensional tensor. In addition, at the time of inference, there is no extending operation, and thus the output becomes a two-dimensional matrix.
The RNN-T loss calculation unit 104 receives the output probability distribution Y (three-dimensional tensor), the symbol sequence c (length U), or a correct answer symbol sequence (length T), calculates a loss LRNN-T on the basis of formula (3), and outputs the loss LRNN-T. The loss LRNN-T may be optimized through the procedure described in “2.5 Training” in NPL 1.
[Math. 3]
log−loss=−In Pr(y*|x) (3)
The sequence length conversion unit 201 receives a symbol sequence c (length U) and a frame unit label sequence (senone) s with word information (denoted as “frame unit label sequence” in
The output matrix extraction unit 202 receives an output probability distribution Y (three-dimensional tensor) and the frame unit symbol sequence c′ (length T) and outputs an output probability distribution Y (two-dimensional matrix). The frame unit symbol sequence c′ (length T) generated by the sequence length conversion unit 201 has information of time information t and symbol information c(u). The output matrix extraction unit 202 selects a vector (length K) at a corresponding position from a U×T plane of the three-dimensional tensor using the information and extracts a two-dimensional matrix of T×K (refer to
The CE loss calculation unit 203 receives the output probability distribution Y (two-dimensional matrix) and the frame unit symbol sequence c′ (length T) and outputs a cross entropy (CE) loss LCE. The CE loss calculation unit 203 calculates the CE loss by using formula (4) for the output probability distribution Y (two-dimensional matrix of T×K) extracted by the output matrix extraction unit 202 and the frame unit symbol sequence c′ (length T) created by the sequence length conversion unit 201.
In formula (3), c′ represents an element of a matrix C′, which is 1 at a correct answer point and 0 in other cases.
The training apparatus 200 updates parameters of the speech distribution expression sequence conversion unit 101, the symbol distribution expression. sequence conversion unit 102, and the label estimation unit 103 using the CE loss LCE.
[Training Apparatus according to Embodiment] Next, a training apparatus according to an embodiment will be described.
The training apparatus 300 is realized, for example, by reading a predetermined program by a computer or the like including a read only memory (ROM), a random access memory (RAM), a central processing unit (CPU), and the like and executing the predetermined program by the CPU. The training apparatus 1 also includes a communication interface for transmitting/receiving various types of information to/from other devices connected via a network or the like. For example, the training apparatus 1 includes a network interface card (NIC) or the like and performs communication. with other devices via an electric communication line such as a local area network (LAN) or the Internet. Further, the training apparatus 1 includes an input device such as a touch panel, a speech input device, a keyboard, and a mouse, and a display device such as a liquid crystal display, and receives and outputs information.
As shown in
When a conversion model parameter is provided, the speech distribution expression sequence conversion unit 301 converts the input acoustic feature amount sequence X into a corresponding intermediate acoustic feature amount sequence H (length T (first length)). The speech distribution. expression sequence conversion unit 301 has an encoder function for converting the input acoustic feature amount sequence X into the intermediate acoustic feature amount sequence H (length T) by a multi-stage neural network and outputting the intermediate acoustic feature amount sequence to the label estimation unit 303. The speech distribution expression sequence conversion unit 301 outputs the sequence length T of the intermediate acoustic feature amount sequence H to the sequence length conversion unit 304.
The sequence length conversion unit 304 receives the symbol sequence c (length U), the sequence length T, and a shift width n. The sequence length conversion unit 304 outputs a frame unit symbol sequence c′ (length T) (first frame unit symbol sequence) and a frame unit symbol sequence c″ (length T) (second frame unit symbol sequence) obtained by delaying the frame unit symbol sequence c′ by one frame.
The symbol distribution expression sequence conversion unit 302 receives the frame unit symbol sequence c″ (length T) output from the sequence length conversion unit 304. The symbol distribution expression sequence conversion unit 302 converts the frame unit symbol sequence c″ into an intermediate character feature amount sequence c″ (length T) using a second conversion model to which a character feature amount estimation model parameter is provided. The symbol distribution expression sequence conversion unit 302 converts the input frame unit symbol sequence c″ (length T) into a one-hot vector once and converts the one-hot vector into the intermediate character feature amount sequence C″ (length T) by a multi-stage neural network.
The label estimation unit 303 receives the intermediate acoustic feature amount sequence H (length T) output from the speech distribution expression sequence conversion unit 301 and the intermediate character feature amount sequence C″ (length T) output from the symbol distribution expression sequence conversion unit 302. The label estimation unit 303 performs label estimation using an estimation model to which an estimation model parameter is provided on the basis of the intermediate acoustic feature amount sequence H (length T) and the intermediate character feature amount sequence C″ (length T) and outputs an output probability distribution Y of a two-dimensional matrix. The label estimation unit 3030 performs label estimation by a neural network from the intermediate acoustic feature amount sequence H and the intermediate character feature amount sequence C″ (length T). The label estimation unit 303 outputs the output probability distribution Y (two-dimensional matrix) as an estimation result by using formula (2)
The CE loss calculation unit 305 receives the output probability distribution Y (two-dimensional matrix) output from the label estimation unit 303 and the frame unit symbol sequence c′ (length T) output from the sequence length conversion unit 304. The CE loss calculation unit 305 calculates a CE loss LCE of an output probability distribution Y with respect to the frame unit symbol sequence c′ on the basis of the frame unit symbol sequence c′ and the output probability distribution I by using formula (3).
The control unit 306 controls processing of each functional unit of the training apparatus 300. The control unit 306 updates a conversion model parameter of the speech distribution expression sequence conversion unit 301, a conversion model parameter of the symbol distribution expression sequence conversion unit 302, and a label estimation model parameter of the label estimation unit 303 using the CE loss LCE calculated by the CE loss calculation unit 305.
The control unit 306 repeats processing performed by the speech distribution expression sequence conversion unit 301, processing performed by the sequence length conversion unit 304, processing performed by the symbol distribution expression sequence conversion unit 302, processing performed by the label estimation unit 303, and processing performed by the CE loss calculation unit 305 until a predetermined termination condition is satisfied.
This termination condition is not limited, and for example, may be a condition that the number of repetitions reaches a threshold value, a condition that the amount of change in the CE loss LCE becomes equal to or less than a threshold value before and after repetition, or a condition that the amount or change in the conversion model parameter in the speech distribution expression sequence conversion unit 301 and the label estimation model parameter in the estimation unit 303 becomes equal to or less than a threshold value before and after repetition. In a case where the termination condition is satisfied, the speech distribution expression sequence conversion unit 301 outputs the conversion model parameter γ1, and the label estimation unit 303 outputs the label estimation model parameter γ2.
Further, the control unit 306 causes RNN-T to pre-train, as an autoregressive model for predicting the next label, a first conversion model, a second conversion model, and an estimation model, by inputting the frame unit symbol sequence c″ (length T) obtained by delaying the frame unit symbol sequence c′ by one frame to the symbol distribution expression sequence conversion unit 302.
[Sequence Length Conversion Unit] Next, processing of the sequence length conversion unit 304 will be described.
First, the sequence length conversion unit 304 adds a blank (“null”) symbol to the head and the tail of the symbol sequence c (length U). Next, the sequence length conversion unit 304 creates a vector c′ having a length T. Thereafter, the sequence length conversion unit 304 divides the number T of frames of the entire input sequence by the number (U+2) of symbols and recursively allocates symbols to c′.
In addition, in a streaming model operating in left-to-right, there is a possibility that output timing is delayed. Therefore, the sequence length conversion unit 304 can change an offset position to which a symbol is allocated by a shift width n. By recursively allocating final frame unit symbol sequence c′ (length T) is obtained.
In addition, the sequence length conversion unit 304 generates a frame unit symbol sequence c″ (length T−1) by delaying the frame unit symbol sequence c′ by one frame and deleting the tail symbol such that the output formed by the label estimation unit 303 becomes two-dimensional, and inputs the frame unit symbol sequence c″ to the symbol distribution expression sequence conversion unit 302. A length T is obtained by adding a blank (“null”) symbol to the head of the frame unit symbol sequence c″ delayed by one frame. Therefore, the training apparatus 300 pre-trains RNN-T as an autoregressive model for predicting the next label.
[Training Processing] Next, a processing procedure of training processing will be described.
The sequence length conversion unit 304 performs sequence length conversion processing (second conversion process) for converting a symbol sequence c′ to generate a frame unit symbol sequence c having a length T and delaying the frame unit symbol sequence c′ by one frame to generate a frame unit symbol sequence c″ having a length T (length T) (step S2).
The symbol distribution expression sequence conversion unit 302 performs symbol distribution expression sequence conversion processing for converting the frame unit symbol sequence c′ (length T) input from the sequence length conversion unit 304 into an intermediate character feature amount sequence C″ (length T) (step S3).
Subsequently, the label estimation unit 303 performs label estimation processing (estimation process) for performing label estimation by a neural network on the basis of the intermediate acoustic feature amount sequence H (length T) output from the speech distribution expression sequence conversion unit 301 and the intermediate character feature amount sequence C″ (length T) output from the symbol distribution expression sequence conversion unit 302, and outputting an output probability distribution Y of a two-dimensional matrix (step S4).
The CE loss calculation unit 305 performs CE loss calculation processing (calculation process) for calculating a CE loss LCE of the output probability distribution Y with respect to the symbol sequence c on the basis of the frame unit symbol sequence c′ and the output probability distribution Y (step S5).
The control unit 306 updates the model parameters of the speech distribution expression sequence conversion unit 301, the symbol distribution expression sequence conversion unit 302, and the label estimation unit 303 using the CE loss (step S6). The control unit 306 repeats the above-describing processing until a predetermined termination condition is satisfied.
[Effects of Embodiment]In the training apparatus 300 according to the embodiment, a frame-by-frame label is dynamically created in the sequence length conversion unit 304, and a label of a senone sequence is not required. That is, the training apparatus 300 does not require a label of a senone sequence which has been conventionally required at the time of dynamically generating a frame-by-frame label. Therefore, since the training apparatus 300 does not use a conventional speech recognition system, it conforms to the End-to-End rule and does not require high-level language specialty, and thus a model can be easily constructed.
In addition, in the training apparatus 300, a frame-by-frame label created in the sequence length conversion unit 304 is shifted by one frame and input to the symbol distribution expression. sequence conversion unit 302, and thus the output of the label estimation unit 303 becomes a two-dimensional matrix.
Then, the sequence length conversion unit 304 creates The frame unit symbol sequence c′ (length T) and simultaneously creates the frame unit symbol sequence c″ (obtained by shifting the frame unit symbol sequence c′ by one frame), and inputs the frame unit symbol sequence c″ to the symbol distribution expression sequence conversion unit 302.
Accordingly, in the training apparatus 300, the sequence lengths of the outputs of the speech distribution expression sequence conversion unit 301 and the symbol distribution expression sequence conversion unit 302 match, and thus the output of the label estimation unit 303 becomes a two-dimensional matrix. In other words, the label estimation unit 303 can directly form an output probability distribution Y (two-dimensional matrix) in which cross entropy can be calculated in the CE loss calculation unit 305.
Therefore, the output sequence of the label estimation unit 303 becomes a two-dimensional matrix in the training apparatus 300, and thus the CE loss can be easily calculated, and costs of memory consumption and training time during training can be greatly reduced. In addition, in the training apparatus 300, it is expected that the initial value is better than a randomly initialized parameter and that the performance of a model is improved by performing fine tuning according to RNN-T loss. Further, in the training apparatus 300, the frame unit symbol sequence c″ obtained by shifting the frame unit symbol sequence c′ by one frame is used, and thus RNN-T is pre-trained as an autoregressive model for predicting the next label.
[Speech Recognition Apparatus] Next, a speech recognition apparatus constructed by providing the conversion model parameter γ1 and the label estimation model parameter γ2 that satisfy the termination condition in the training apparatus 300 will be described.
As illustrated in
An acoustic feature amount sequence X″ that is a speech recognition target is input to the speech distribution expression sequence conversion unit 401. The speech distribution expression sequence conversion unit 401 obtains and outputs an intermediate acoustic feature amount sequence H″ corresponding to the acoustic feature amount sequence X″′ in a case where the conversion model parameter γ1 is provided (step S11 in
The intermediate acoustic feature amount sequence H″ output from the speech distribution. expression sequence conversion unit 401 is input to the label estimation unit 402. The label estimation unit 402 obtains a label sequence (output probability distribution) corresponding to the intermediate acoustic feature amount sequence H in a case where the label estimation model parameter γ2 is provided as a speech recognition result and outputs the label sequence (step S12 in
In this way, model parameters optimized by the training apparatus 300 using CE loss are set in the label estimation unit 402 and the speech distribution expression sequence conversion unit 401 in the speech recognition apparatus 400, and thus speech recognition processing can be performed with high accuracy.
[System Configuration of Embodiment] Each component of the training apparatus 300 and the speech recognition apparatus 400 is a functional concept, and does not necessarily have to be physically configured as illustrated in the drawings. That is, specific manners of distribution and integration of the functions of the training apparatus 300 and the speech recognition apparatus 400 are not limited to those illustrated, and all or some thereof may be functionally or physically distributed or integrated in suitable units according to various types of loads or conditions in which the training apparatus 300 and the speech recognition apparatus 400 are used.
In addition, all or some processing performed in the training apparatus 300 and the speech recognition apparatus 400 may be realized by a CPU, a graphics processing unit (CPU), and a program analyzed and executed by the CPU and the GPU. Further, each type of processing performed in the training apparatus 300 and the speech recognition apparatus 400 may be implemented as hardware according to wired logic.
Moreover, among types of processing described in the embodiments, all or some processing described as being automatically performed can also be manually performed. Or, all or some processing described as being manually performed can also be automatically performed through a known method. In addition, the above-mentioned and shown processing procedures, control procedures, specific names, and information including various types of data and parameters can be appropriately changed unless otherwise specified.
[Program]
The memory 1010 includes a ROM 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a Basic Input Output System (BIOS). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disc drive interface 1040 is connected to a disc drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disc is inserted into the disc drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, a display 1130.
The hard disk drive 1090 stores, for example, an operating system (OS) 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each type of processing of the training apparatus 300 and the speech recognition apparatus 400 is implemented as the program module 1093 in which a code that can be executed by the computer 1000 is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as the functional configuration in the training apparatus 300 and the speech recognition apparatus 400 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced with a solid state drive (SSD).
Furthermore, the setting data used in the processing of the above-described embodiment is stored, for example, in the memory 1010 or the hard disk drive 1090 as the program data 1094. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 or the hard disk drive 1090 into the RAM 1012 and executes them as necessary.
The program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, and may also be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disc drive 1100. Alternatively, the program module 1093 and program data 1094 may be stored in other computers connected via a network (for example, local area network (LAN) or wide area network (WAN)). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.
Although the embodiments to which the invention made by the present inventor has been applied have been described above, the present invention is not limited by the description and the drawings that form a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art or the like on the basis of the present embodiment are all included in the category of the present invention
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/003730 | 2/2/2021 | WO |