The present invention relates to a model adaptation technology of a language model.
A technology for adapting a model parameter of a language model to a specific topic or genre is an important issue for speech recognition or the like. It is expected that a language model adapted to a specific topic or genre may provide higher prediction accuracy of tasks such as language analysis or speech recognition than general language models.
One of the known model adaptation technologies of a language model is disclosed in NPL 1. NPL 1 discloses a model adaptation technology of a language model that is learned with a recurrent neural network (RNN). This technology adds a label of a correct genre label during a model learning to adapt a language model not depending on a genre to a model depending on a specific genre.
In addition, a model adaptation technology in an acoustic model, not in a language model, is disclosed in NPL 2.
[NPL 1] Salil Deena, Madina Hasan, Mortaza Doulaty, Oscar Saz, and Thomas Hain, “Combining feature and model-based adaptation of RNNLMs for multi-genre broadcast speech recognition,” in INTERSPEECH, 2016, pp. 2343-2347.
[NPL 2] Marc Delcroix, Keisuke Kinoshita, Chengzhu Yu, Atsunori Ogawa, Takuya Yoshioka, and Tomohiro Nakatani, “Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5270-5274.
Unfortunately, the model adaptation technology disclosed in NPL 1 uses only a genre as auxiliary information for adaptation and uses only one type of NN that uses the genre as auxiliary information, thus providing poor adaptability. In addition, it is difficult to perform appropriate learning if the correct genre label is not given.
In addition, the model adaptation technology disclosed in NPL 2 shows high adaptability in the acoustic model. Unfortunately, it is unclear how to apply the model adaptation technology to the language model and whether it is effective to apply the technology to the language model.
In view of the above problems, a purpose of the present invention is to provide a model adaptation technology of a language model with higher adaptability.
To solve the above problems, an aspect of the present disclosure relates to an apparatus includes a first neural network unit that transforms an input symbol and outputs an intermediate state; and a second neural network unit that transforms input auxiliary information and the intermediate state and predicts a symbol following the input symbol, wherein the second neural network unit includes a plurality of hidden layers receiving, as input, the intermediate state and auxiliary information, and pieces of the auxiliary information input to each hidden layer are different from each other.
The present invention can provide a model adaptation technology of a language model with higher adaptability.
The following examples disclose an estimation apparatus for estimating, from an input symbol, the next symbol using a learned neural network (language model) and a language model learning apparatus for learning the language model. Here, a “symbol” is an object to be processed by the language model, including, for example, a word, a phoneme, a morpheme, and a character. On the basis of the input symbol, the language model outputs a prediction result of a symbol that is likely to appear next. A process will be described below for an example where the processing unit is a “word.”
A language model according to the examples described below receives, as input, a word vector w(t) that expresses a word in a vector having predetermined dimensions, and outputs a prediction result {circumflex over ( )}w(t+1) of the next word using auxiliary information. For example, in the speech recognition, the model is used to predict a word {circumflex over ( )}w(t+1) that is likely to be uttered next to a word w(t) uttered at time t or the like. The language model of the present invention has a primary feature that a neural network that depends on auxiliary information (such as a topic, a context, and a speaker) includes a plurality of separated hidden layers and each separated hidden layer performs transformation based on different auxiliary information (γn).
First, with reference to
As shown in
The first transformation unit 110 acquires a first output state x(t) by linearly transforming an input vector w(t) with a weight matrix U (w). For example, the first transformation unit 110 may be implemented by a linear layer of a neural network. Specifically, the first transformation unit 110 calculates x(t) for a vector w(t) obtained by encoding a word input at a time point t, according to
x(t)=U(w)w(t) [Formula 1]
and outputs x(t) to the second transformation unit 120.
The second transformation unit 120 acquires a second output state h(t) by inputting x(t) output from the first transformation unit 110 to the neural network. The configuration of the neural network is not particularly limited and may include a recurrent neural network (RNN), a long short term memory (LSTM), a deep neural network (DNN), and a convolutional neural network (CNN) or the like. By way of example, when implementing the neural network with the LSTM, the second transformation unit 120 calculates the second output state h(t) according to
i(t)=σ(W(i,w)x(t)+W(i,h)h(t−1)+b(i))
f(t)=σ(W(f,w)x(t)+W(f,h)h(t−1)+b(f))
o(t)=σ(W(o,w)x(t)+W(o,h)h(t−1)+b(o))
g(t)=tanh(W(g,w)x(t)+W(g,h)h(t−1)+b(g))
c(t)=f(t)⊙ c(t−1)+i(t)⊙ g(t)
h(t)=o(t)⊙ tanh(c(t)) [Formula 2]
and outputs h(t) to the fourth transformation unit 140. Here, i(t), f(t), and o(t) are an input gate, a forget gate, and an output gate in the LSTM, respectively, and W (j, w) and W (j, h) are a weight matrix of a gate j of the input word and a previous hidden layer, respectively. In addition, σ(·) denotes element-wise sigmoid of a vector, tanh(·) denotes element-wise hyperbolic tangent of of a vector, and
⊙ [Formula 3]
denotes element-wise multiplication of vectors. The element-wise sigmoid a vector is a function that returns another vector that is composed of sigmoid function values for respective elements of the vector. The element-wise hyperbolic tangent a vector is a function that returns another vector that is composed of tanh function values for respective elements of the vector. In addition, c(t) is a hidden state.
The third transformation unit 130 linearly transforms auxiliary information a(t), which is an output state of the Latent Dirichlet Allocation (LDA), with the neural network. The third transformation unit 130 thus acquires transformed auxiliary information γ and outputs it to the fourth transformation unit 140. Specifically, the third transformation unit 130 may be implemented as a linear layer for linear transformation, for a weight matrix of U (a) and a bias vector of b (U, a), according to
γ=[γ1, γ2, . . . γn, . . . , γN]=σ(U(a)a(t)+b(U,a)) [Formula 4]
The transformed auxiliary information γ is separated into each element. Each separated element γn is input to a corresponding hidden layer 140_n in the fourth transformation unit 140.
Here, the Latent Dirichlet Allocation (LDA) is a sort of a topic model for estimating a latent topic (such as a subject and a field) of a text document. The LDA receives a text document as input and outputs a vector of a probability distribution representing a probability that the text document belongs to each latent topic. This output is a(t) used as input to the third transformation unit 130. In other words, the LDA is a generative model for estimating a topic in a set of text documents, each text document in the set is modeled as a combination of latent topics, and the latent topics are modeled as a combination on a topic probability. The LDA ignores the order of the words in the text document and expresses the features of the text document using a bag-of-words model. Specifically, the LDA expresses the text document as a low rank matrix that is composed of sentences in the text document expressed as vectors using the bag-of-words model. The LDA generates each text document using the following generating process.
1. The length M of a text document is sampled from the Poisson distribution.
2. A multinomial distribution for topics in the text document is selected by sampling from the Dirichlet distribution identified with α.
3. A topic is selected for each word qm of M words and a word wm is selected from the Uni-gram distribution related to the topic.
p(wm|qm, β) [Formula 5]
where α and β are main parameters in the LDA and are learned in advance by model learning on the basis of training data. α determines the shape of the Dirichlet distribution on the multinomial distribution from which the topic is extracted. β is a parameter that directly affects a word probability.
The fourth transformation unit 140 includes N separated hidden layers 140_1, 140_2, . . . , and 140_N. The hidden layers are implemented by the linear hidden network (LHN). The parameter for the nth hidden layer 140_n is expressed as Ln (w) and the bias vector as bn (L, w). Then, if among the transformed auxiliary information γ output from the third transformation unit 130, information γn is input to the nth hidden layer 140_n, the nth hidden layer 140_n receives, as input, a second output state h(t) provided from the second transformation unit 120 and γn provided from the third transformation unit 130 and outputs, to the fifth transformation unit 150, an output state zn, which is,
γn(Ln(w)h(t)+bn(L,w)) [Formula 6]
The fifth transformation unit 150 outputs, for an output state zn (n=1, 2, . . . , N) that is output from the fourth transformation unit 140, a softmax function value as a prediction result {circumflex over ( )}w(t+1), according to
Here, the softmax(·) is a softmax function. It is a function that transforms an m-dimensional vector composed of real number values to an m-dimensional vector. It is also a function of transforming an input of e=(e1, e2, . . . , em)T (T represents transpose) to an output of {circumflex over ( )}e=({circumflex over ( )}e1, {circumflex over ( )}e2, . . . , {circumflex over ( )}em)T as follows:
In other words, it is a function that normalizes the sum of elements in the output vector to be equivalent to 1.
Although, in the above examples, the auxiliary information a(t) is a probability distribution (vector) that represents a latent topic probability of a text document using the Latent Dirichlet Allocation (LDA), the auxiliary information a(t) is not limited thereto. For example, the auxiliary information a(t) may be any appropriate auxiliary information that affects the selection of the prediction result {circumflex over ( )}w(t+1) of the next word, which is an output of the language model, excepting the information (hereinafter referred to “input information to the separated hidden layers”) generated only from the immediately preceding word w(t). For example, the auxiliary information a(t) may include the word classes of the past character strings before w(t), the context of the immediately preceding sentence (such as information relating to the utterance content of the immediately preceding speaker in a dialogue), the atmosphere of the conversation (formal or casual), the relationship with the conversation partner (such as a boss or subordinate, friends, and the first meeting person), the speaker's gender, the age, and the area of origin (that affects the language and dialect). Information such as the atmosphere of the conversation, the conversation partner, and the speaker's gender, which is sort of a word string w(t0), . . . , w(t) that is chronologically previous to the word {circumflex over ( )}w(t+1) to be predicted, may be information representing the acquisition environment of the word string. Note that in this example, t0 and t are indices of a word string, a smaller t is an index of a chronologically more previous word, and t0<t. For example, for an index t0 of a word at the beginning of a sentence, there is a word string w(t0), . . . , w(t) from the beginning of the sentence to an index (t) immediately preceding to an index t+1 of the word to be predicted.
In addition, in the above examples, the auxiliary information a(t) is one type of information, a topic probability, that is separated into the number (N) of separated hidden layers, providing either one of pieces of auxiliary information γn (n=1, . . . , N) that is input to respective separated hidden layers. However, the present invention is not limited to the above configuration and may have a configuration in which a plurality of different types of auxiliary information is generated from the above various types of auxiliary information and is input to respective separated hidden layers. In short, the same type or different types of pieces of auxiliary information may be input to the separated hidden layers as long as pieces of the auxiliary information are different from each other.
As described above, each of the separated hidden layers in the fourth transformation unit in this example provides the input auxiliary information with an adaptation process corresponding to the auxiliary information and outputs the result. Then, the fifth transformation unit plays the role of totally considering all results from the separated hidden layers and outputting the prediction result {circumflex over ( )}w(t+1) of the next word that is finally output.
With reference to
As shown in
An input w(t) to the first transformation unit 210 is a word vector provided for learning. It is assumed that the input w(t) is previously associated with an output vector w(t+1) of the correct answer provided to the parameter update unit 260. Like the first transformation unit 110, the first transformation unit 210 acquires a first output state x(t) by linearly transforming the input vector w(t) with a weight matrix U (w) and outputs x(t) to the second transformation unit 220.
Like the second transformation unit 120, the second transformation unit 220 acquires a second output state h(t) by inputting x(t) output from the first transformation unit 210 to the LSTM and outputs h(t) to the fourth transformation unit 240.
An input a(t) to the third transformation unit 230 is auxiliary information corresponding to the word vector w(t) provided for learning. The auxiliary information a(t) may be previously stored in association with w(t) as training data or may be calculated with the Latent Dirichlet Allocation (LDA) or the like from the training data w(t) that is input to the first transformation unit 210. Note that it is assumed that the type of auxiliary information used in learning and the type of auxiliary information used in operation of the estimation apparatus 100 are the same. Like the third transformation unit 130, the third transformation unit 230 acquires auxiliary information γ by linearly transforming the auxiliary information a(t) with the neural network and outputs γ to the fourth transformation unit 240.
The fourth transformation unit 240 includes N separated hidden layers 240_1, 240_2, . . . , 240_N. Like the nth hidden layer 140_n in the fourth transformation unit 140, the nth hidden layer 240_n acquires an output state zn from a second output state h(t) provided from the second transformation unit 220 and γn provided from the third transformation unit 230 and outputs zn to the fifth transformation unit 250.
Like the fifth transformation unit 150, the fifth transformation unit 250 transforms an output state zn (n=1, 2, . . . , N) that is output from the fourth transformation unit 240 to generate a prediction output vector {circumflex over ( )}w(t+1) and outputs it to the parameter update unit 260. The prediction output vector {circumflex over ( )}w(t+1) may be a softmax function value related to the output state zn, for example. Note that here to distinguish from a correct answer data w(t+1) corresponding to w(t) included in the training data, the output prediction output vector is represented by the symbol {circumflex over ( )}w(t+1).
The parameter update unit 260 compares the prediction output vector {circumflex over ( )}w(t+1) that is output from the fifth transformation unit 250 and the correct answer output vector w(t+1) that is previously provided as training data and updates parameters of the language model depending on the comparison result. The parameters may be updated according to, for example, any appropriate parameter update technique in the neural network such as the error propagation. Thus, the parameters of the neural network are repeatedly updated so that the prediction output vector {circumflex over ( )}w(t+1) approaches the correct answer output vector w(t+1).
Then, if the comparison result between the prediction output vector {circumflex over ( )}w(t+1) and the correct answer output vector w(t+1) reaches a predetermined criterion, the parameter update unit 260 outputs the parameter values of the neural network, and ends the parameter update process. The predetermined criterion is, for example, that the proximity (such as the distance) between {circumflex over ( )}w(t+1) and w(t+1) becomes less than a predetermined threshold, or the repeat number (the number of update processes by the parameter update unit 260) reaches a predetermined number of times or the like.
Note that if the learned language model is used in the speech recognition, the input vector w(t) may be a word vector that corresponds to a word string obtained by recognizing an input voice signal s(t) in a time interval t. The word vector may be input to the language model to obtain the word candidate {circumflex over ( )}w(t+1) that is likely to be uttered next. As a result, when obtaining a word string that is the speech recognition result of a voice signal s(t+1) in the next time interval, the estimation apparatus 100 may obtain the recognition result with reference to {circumflex over ( )}w(t+1) predicted by the language model.
Here, the estimation apparatus 100 and language model learning apparatus 200 may be typically provided by a calculation apparatus such as a server. They may include, for example, as shown in
The estimation apparatus 100 and the language model learning apparatus 200 are not limited to the above hardware configuration and may be implemented by any other appropriate hardware configuration.
Thus, while the examples of the present invention have been described in detail, the present invention is not limited to the above specific embodiments and various modifications and variations may be made within the gist and scope of the present invention described in the claims.
Number | Date | Country | Kind |
---|---|---|---|
2018-041781 | Mar 2018 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/005871 | 2/18/2019 | WO | 00 |