The present application claims priority based on Japanese Patent Application No. 2024-005639 filed Jan. 17, 2024, the content of which is incorporated herein by reference.
The present invention relates to an attention-based contextual biasing method that can be customized using an editable phrase list.
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer.
This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.
End-to-end (E2E) automatic speech recognition (ASR) [1, 2] methods directly convert acoustic feature sequences to token sequences without requiring the multiple components used in conventional ASR systems, such as acoustic models (AM) and language models (LM). Various E2E-ASR methods have been proposed previously, including connectionist temporal classification (CTC) [3], recurrent neural network transducer (RNN-T) [4], attention mechanism [5, 6], and their various hybrid systems [7-9]. Since the effectiveness of E2E-ASR methods is inherently related to the context in the training data, performance expectations may not be satisfied consistently for the given user context. For example, personal names and technical terms tend to be important keywords in different contexts, but such terms may not appear frequently in the available training data, which would result in poor recognition accuracy. It is impractical to train a model for all contexts during training; thus, the user or developer should be able to contextualize the model easily without training.
A typical approach to this problem is shallow fusion using an external LM [10-14]. For example, [10-12] used a weighted finite state transducer (WFST) to construct an in-class LM to facilitate contextualization for the target named entities. Neural LM fusion methods have been also proposed [13,14]. The LM fusion technique attempts to enhance accuracy by combining an E2E-ASR model with an external neural LM and then rescoring the hypotheses generated by the E2E-ASR model. However, whether employing WFST or neural LMs, training an external LM requires additional training steps.
Thus, several methods have been proposed that do not require retraining. These methods include knowledge graph modeling for recognizing out-of-vocabulary named entities, contextual spelling correction using an editable phrase list, and named entity aware ASR model that recognize specific named entities based on phoneme similarity. However, these methods have limitations, such as requiring a speech synthesis (TTS) model for training and not being able to handle words other than predefined target named entities.
Deep biasing methods [17-20] provide an alternative approach to realize effective contextualization without requiring retraining processes and TTS models. In such methods, the E2E-ASR model can be contextualized using an editable phrase list, which is referred to as a bias list in this paper. Most deep biasing methods implement a cross-attention layer between the bias list and input sequences to recognize the bias phrases correctly. However, it has been observed that simply adding a cross-attention layer for the bias list is not effective [21]. Thus, [21, 22] introduced an additional branch designed to detect bias phrases, which indirectly helps to update the parameters of the cross-attention layer through an auxiliary loss. In contrast, [23, 24] introduced an auxiliary loss function directly on the cross-attention layer (referred to as bias phrase index loss and will be described in Section 3.2), which detects to the bias phrase index. While this approach allows for a direct parameter update of the cross-attention layer, it cannot distinguish whether the output tokens come from the bias list or not. In addition, requires two-stage training using a pretrained ASR model, which is time consuming.
This paper proposes a deep biasing method that employs both an auxiliary loss directly on the cross-attention layer, termed as bias phrase index loss, and special tokens for bias phrases to realize more effective bias phrase detection. Unlike conventional indirect methods [21,22], our method facilitates the effective training of the crossattention layer through the bias phrase index loss. Additionally, our technique departs from current methods by introducing special tokens for bias phrases. This allows the model to focus on the bias phrases more effectively, eliminating the need for a two-stage training process. Furthermore, we propose a bias phrase boosted (BPB) beam search algorithm that integrates the bias phrase index probability during inference, augmenting the performance in bias phrase recognition. The main contributions of this study are as follows:
This section describes an attention-based encoder-decoder system that consists of an audio encoder and an attention-based decoder, which are extended to the proposed method.
The audio encoder comprises two convolutional layers, a linear projection layer, and Ma Conformer blocks [25]. The Conformer encoder transforms an audio feature sequenceX to T length hidden state vectors H=[h1, . . . , hr]∈RTxd where d represents the dimension as follows:
The posterior probability is formulated as follows:
where s and S represent the token index and the total number of tokens, respectively. Given H generated by the audio encoder in Eq. (1) and the previous token sequence y0:s-1, the attention-based decoder recursively estimates the next token ys as follows:
The attention-based decoder comprises an embedding layer with a positional encoding layer, Ma Transformer blocks, and a linear layer. Each Transformer block has a multiheaded self-attention layer, a cross-attention layer (i.e., audio attention), and a linear layer with
layer normalization (LN) layers and residual connections. Here, the audio attention layer including the LN is formulated as follows:
where U and U′ represent the input and output of the audio attention layer, respectively. In addition, the hybrid CTC/attention model [7] includes a CTC decoder. The attention-based decoder will be extended to the proposed bias decoder in Section 3.2.
The bias encoder comprises an embedding layer with a positional encoding layer, Me Transformer blocks, a mean pooling layer, and a bias list B={b0, b1, . . . bN}, where n and bn represent the bias phrase index and the token sequence of the n-th bias phrase (e.g., “play a song”), respectively. Here, b0 is a dummy phrase which means “no-bias”. After applying zero padding based on the max token length Lmax in the bias list B, the embedding layer and the Transformer blocks extract a set of token-level feature sequences, G∈R(N+1)×Lmax×d as follows:
Then, mean pooling is performed to extract a phrase-level feature sequence, V=[v0, v1, . . . , vN]∈R(N+1)×d, as follows:
The bias decoder is an extension of the attention-based decoder described in Section 2.2, where an additional cross-attention layer (i.e., bias attention) is introduced to each Transformer block, as shown in
GivenH, V in Eqs. (1), (6), and y0:s-1, the bias decoder estimates the next token ys recursively, unlike Eq. (3), as follows:
In the Transformer block of the bias decoder, the bias attention layer including the LN is formulated as follows:
In addition, the bias attention layer estimates the bias phrase index sequence în=[{circumflex over ( )}n1, {circumflex over ( )}n2, . . . , {circumflex over ( )}nS] as follows:
where u′s denotes the s-th feature vector of U′=[u′0, u′1, . . . , u′S]. For example, if a bias phrase, “play a song” with a bias index of 2 (
where ygt and {circumflex over ( )}ngt represent the one-hot vector sequences of the reference transcription and the reference bias phrase index including the no-bias option. Here, we refer to Lbidx as bias phrase index loss, respectively.
During the training process, a bias list B is created randomly from the corresponding reference transcriptions for each batch. Specifically, 0 to Nutt bias phrases of 2 to Lmax token lengths are extracted uniformly for each utterance, resulting in a total of N bias phrases (Nutt×nbatch). After the bias list B is extracted randomly, special tokens (<sob>/<eob>) are inserted before and after the extracted phrases in the reference transcription to distinguish whether the output tokens come from the bias list or not. The proposed method is optimized via multitask learning using the weighted sum of losses, as expressed in Eqs. (12), (13), and the CTC loss (Lctc):
where λctc, λbatt, and λbidx represent the training weights.
We also propose a bias phrase boosted (BPB) beam search algorithm that exploits the bias phrase probability as described in Algorithm 1. The bias decoder calculates the token probability pnew including the special tokens, <sob>/<eob>, using Eq. (8) (line 5). We then estimate the bias phrase index {circumflex over ( )}ns using Eq. (11) and the argmax function (line 6). Here, the number of bias phrases N in the bias list B can increase significantly during inference, which would reduce the peak value after applying the softmax function in Eq. (9). Thus, Eq. (9) is approximated using the top kscore pruning as follows:
Then, if {circumflex over ( )}ns=0 (i.e., “no-bias”), the token probabilities for the special tokens pnew[sob] and pnew[eob] are penalized based on the weight αpen (line 8, 9), otherwise, the corresponding token probabilities are increased according to the weight αbonus (line 11-13). For example, if the detected bias phrase is “play a song”, the token probabilities for “play”, “a”, and “song” are increased with αbonus. Based on the boosted probabilities pnew, the top kbeam pruning is performed as in the conventional beam search [7].
The input features are 80-dimensional Mel filterbanks with a window size of 512 samples and a hop length of 160 samples. Then, SpecAugment [26] is applied. The audio encoder has two convolutional layers with a stride of two for downsampling, a 256-dimensional linear projection layer, and 12 Conformer blocks with 1024 linear units. The bias encoder and the bias decoder have three Transformer blocks with 1024 linear units and six Transformer layers with 2048 units, respectively. The attention layers in the audio encoder, the bias encoder, and the bias decoder are 4-multihead attentions with a dimension, d, of 256. During the training process, a bias list B is created randomly for each batch with Nutt=2 and Lmax=10 described in Section 3.3. In this experiment, the bias list B has a total of N=50 to 200 bias phrases within a batch. The training weights λcte, λbatt, and λbidx (described in Eq. (14)) are set to 0.3, 0.7, and 1.0, respectively. The proposed model is trained 150 epochs at a learning rate of 0.0015 with 15,000 warmup steps using the Adam optimizer. During the decoding process, the hyper parameters of kbeam, kscore, αbonus, and Open (Section 3.4) are set to 20, 50, 1.0 and 10.0, respectively.
The Librispeech corpus (960 h, 100 h) [27] is used to evaluate the proposed method using ESPnet as the E2E-ASR toolkit [28]. The proposed method is evaluated in terms of word error rate (WER), bias phrase WER (B-WER), and unbiased phrase WER (U-WER) [29]. Note that insertion errors are counted toward B-WER if the inserted phrases are present in the bias list; otherwise, insertion errors are counted toward the U-WER. The goal of the proposed method is to improve the B-WER with a slight degradation in the U-WER and overall WER.
Firstly, we verify the effect of the proposed techniques on the Librispeech-100 as a preliminary experiment. Table 1 shows the effect of the bias phrase index loss, Lbidx described in Eq. (13), the special tokens for the bias phrases (<sob>/<eob>), and the BPB beam search on the Librispeech-100 test-clean evaluation set with a bias list size of N=100. Comparing with the baseline (the hybrid CTC/attention model [7]), simply introducing the bias attention layer does not improve the performance (A1 vs. B1), whereas the bias phrase index loss improves the B-WER significantly, which results in an improvement to the overall WER (B1 vs. B2).
Table 2 shows the results obtained by the proposed method on the Librispeech-960 data for different bias list sizes N. Baseline is the hybrid CTC/attention model [7]. When the bias list size N=100, the proposed method improves the B-WER, which in turn significantly
improves the U-WER and WER. In addition, the proposed BPB beam search technique further improves the B-WER without degrading the overall WER and U-WER. The B-WER and U-WER tend to deteriorate as the number of bias phrases N increased; however, the proposed BPB beam search technique is particularly effective in terms of suppressing the deterioration of the B-WER. As a result, the proposed method outperforms the baseline in terms of both WER and B-WER. Although the proposed method underperforms the baseline when no bias phrases are used (N=0), we do not consider it as a critical issue because the users usually register important keywords for them.
We also validate the proposed method on our in-house dataset containing 93 hours of Japanese speech data, including meeting and morning assembly scenarios, the Corpus of Spontaneous Japanese (581 h) [30], and 181 hours of Japanese speech in the database developed by the Advanced Telecommunications Research Institute International with the same experimental setup described in Section 4.1. Table 3 shows the evaluation results obtained on the inhouse dataset when N=203 phrases, such as personal names and technical terms, are registered in the bias list B. The proposed method improves the B-CER significantly with a slight degradation in the overall CER. Thus, the proposed method is effective for both English and Japanese languages.
This study introduces a deep biasing model incorporating bias phrase index loss and specialized tokens for bias phrases. Additionally, the BPB beam search technique is employed, leveraging bias phrase index probabilities to enhance accuracy. Experimental results demonstrate that our model enhances both WER and B-WER performances. Notably, the BPB beam search boosts B-WER performance with minimal impact on overall WER, evident in both English and Japanese datasets.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2024-005639 | Jan 2024 | JP | national |