The present invention relates to a speech recognition apparatus, an acoustic model learning apparatus, and a speech recognition method for performing speech recognition, and further relates to a computer-readable recording medium in which a program for executing these apparatus and method is recorded.
Conventionally, in speech recognition, a speech signal input from a microphone is first converted into a feature value vector, and then the feature value vector is converted into a phoneme sequence by an acoustic model. After that, the phoneme string is converted into a word string by the language model, and the obtained word string is output as a recognition result. That is, in the conventional speech recognition, the signal processing by the acoustic model and the recognition processing by the language model are performed separately.
On the other hand, in speech recognition in recent years, with the progress of deep learning, a method (E2E: End to End) for integrally learning a model representing the relationship between the speech signal and the word string has been proposed. According to the E2E method, learning with a large amount of training data can be efficiently performed, so that improvement in speech recognition accuracy can be expected.
Further, in order to further improve the speech recognition accuracy, a technique using an embedded vector as a parameter of a model in E2E has been proposed (see, for example, Non-Patent Document 1). Specifically, in the technique disclosed in Non-Patent Document 1, an audio signal and a word vector related thereto are learned together to construct a model. Further, in the technique disclosed in Non-Patent Document 1, the word vector is generated, by first recognizing the speech before and after the speech to be learned, dividing the text obtained by the speech recognition into words, and by calculating an embedded vector from each word. The embedded vector may be calculated from the one-hot expression of each word.
According to the technique disclosed in Non-Patent Document 1 described above, it is considered that it is possible to improve the speech recognition accuracy as compared with the case where the embedded vector is not used. However, in the technique disclosed in Non-Patent Document 1 described above, since the embedded vector itself is generated by speech recognition, there is a limit to improving the speech recognition accuracy.
An example object is to solve the above problem, and to provide a speech recognition apparatus, a speech recognition method, and a computer-readable recording medium that perform a speech recognition using an embedded vector generated without using speech recognition. And also, the example object is to provide an acoustic model learning apparatus that learns an acoustic model for performing the speech recognition.
In order to achieve the above-described object, a speech recognition apparatus according to an example aspect of the invention includes:
a data acquisition unit that acquires speech data and sensor data to be recognized,
a speech recognition unit that converts the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
In order to achieve the above-described object, an acoustic model learning apparatus according to an example aspect of the invention includes:
a data acquisition unit that acquires speech data to be training data, teacher data to be the training data, and sensor data related to the training data,
an acoustic model construction unit constructs an acoustic model by machine learning using embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
In addition, in order to achieve the above-described object, a speech recognition method according to an example aspect of the invention includes:
a data acquisition step of acquiring speech data and sensor data to be recognized,
a speech recognition step of converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
Further, in order to achieve the above-described object, a computer-readable recording medium according to an example aspect of the invention that includes a program recorded thereon, the program including instructions that cause a computer to carry out:
a data acquisition step of acquiring speech data and sensor data to be recognized,
a speech recognition step of converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
As described above, according to the present invention, it is possible to perform the speech recognition using the embedded vector generated without using speech recognition.
Hereinafter, in the first example embodiment, an acoustic model learning apparatus, an acoustic model learning method, and a program for realizing these will be described with reference to
[Apparatus Configuration]
First, a configuration of the acoustic model learning apparatus according to the first example embodiment will be described using
The acoustic model learning apparatus 10 according to the first example embodiment shown in
In this configuration, the data acquisition unit 11 acquires speech data to be training data, teacher data to be the training data, and sensor data related to the training data. The acoustic model construction unit 12 constructs an acoustic model by machine learning using embedded vector in addition to the speech data to be training data and teacher data to be the training data. The embedded vector is generated from the sensor data related to the acquired training data by the data acquisition unit 11.
As described above, in the first example embodiment, the acoustic model learning apparatus 10 can construct the acoustic model using the embedded vector generated without using speech recognition.
Subsequently, the configuration and function of the acoustic model learning apparatus 10 according to the first example embodiment will be described more specifically.
First, in the first example embodiment, the data acquisition unit 11 acquires speech data and teacher data to be training data from an external terminal device or the like connected by a network or the like. The teacher data is text data obtained by transcribing the utterance of the speech data.
In the first example embodiment, the acoustic model construction unit 12 first generates an embedded vector using the sensor data related to the training data. Specifically, when the sensor data is input, the acoustic model construction unit 12 inputs the sensor data related to the training data into the model that outputs the data related to the sensor data, and generates the embedded vector from the data output from the model. Examples of the sensor data include image data, temperature data, location data, time data, illuminance data, and the like. In the first example embodiment, any one of these is used.
An example of the embedded vector will be described below with reference to
In the example of
For example, in
In the example of
In the example of
In the example of
Next, the acoustic model construction unit 12 applies the acquired word to each dimension (leftmost column) of the preset vector to generate the embedded vector by setting the dimension matched the word to “1”, the dimension did not match the word to “0”.
In the example of
Further, in the example of
[Apparatus Operation]
Next, the operation of the acoustic model learning apparatus 10 according to the first example embodiment will be described with reference to
As shown in
Next, the acoustic model construction unit 12 generates an embedded vector using the sensor data acquired in step A1 (step A2). Specifically, for example, when the sensor data is image data, the acoustic model construction unit 12 generates the embedded vector by the method shown in
Next, the acoustic model construction unit 12 contracts the acoustic model by adding the embedded vector generated in step A2 to the training data acquired in step A1 and executing machine learning (step A3). Specifically, the acoustic model construction unit 12 updates the parameters of the acoustic model by inputting the training data and the embedded vector into the existing acoustic model, for example.
Steps A1 to A3 are executed each time training data is acquired. Further, by repeatedly executing steps A1 to A3, the accuracy of the acoustic model is also improved.
As described above, according to the first example embodiment, it is possible to construct the acoustic model using the embedded vector generated without using speech recognition. Therefore, according to this acoustic model, it is possible to perform speech recognition using an embedded vector generated without using speech recognition.
[Modified example]
In the first example embodiment described above, the sensor data is only one of the image data, the temperature data, the location data, the time data, and the illuminance data, but the first example embodiment is not limited to this aspect. In the first example embodiment, the sensor data may be a combination of two or more among image data, temperature data, location data, time data, and illuminance data. Further, in this case, the acoustic model construction unit 12 generates the embedded vector for each of the combined sensor data and executes machine learning using the generated embedded vector for each combined sensor data.
[Program]
It is sufficient that the program according to the first example embodiment be a program that causes a computer to execute steps A1 to A3 illustrated in
Also, the program according to the first example embodiment may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as one of the data acquisition unit 11 and the acoustic model construction unit 12.
Next, in the second example embodiment, a speech recognition apparatus, a speech recognition method, and a program for realizing these will be described with reference to
[Apparatus Configuration]
First, a configuration of the speech recognition apparatus according to the second example embodiment will be described with reference to
The speech recognition apparatus 20 according to the second example embodiment shown in
In this configuration, the data acquisition unit 21 acquires speech data and sensor data to be recognized. The speech recognition unit 22 converts the acquired speech data into text data by applying the acquired speech data and sensor data to the acoustic model.
Further, in the second example embodiment, the acoustic model is constructed by machine learning using an embedded vector generated from sensor data related to the training data in addition to the speech data to be the training data and teacher data to be the training data.
Therefore, according to the speech recognition apparatus 20 in the second example embodiment, the speech recognition can be executed by using the embedded vector generated without using the speech recognition.
Subsequently, the configuration and function of the speech recognition apparatus 20 according to the second example embodiment will be described more specifically.
First, in the second example embodiment, the data acquisition unit 21 acquires speech data and sensor data to be recognized from an external terminal device or the like connected by a network or the like. Examples of the sensor data include image data, temperature data, location data, time data, illuminance data, and the like, as in the first example embodiment.
Further, the acoustic model used in the second example embodiment is constructed by the acoustic model learning apparatus 10 according to the first example embodiment using the embedded vector. Therefore, in the second example embodiment, the speech recognition unit 22 first generates the embedded vector from the sensor data acquired by the data acquisition unit 21. Specifically, the speech recognition unit 22 generates the embedded vector by the same method as the acoustic model construction unit 12 according to the first example embodiment.
For example, when the sensor data is image data, the speech recognition unit 22 generates the embedded vector by the method shown in
Then, the speech recognition unit 22 converts the speech data into text data by applying the speech data and the generated embedded vector to the acoustic model.
[Apparatus Operation]
Next, the operation of the speech recognition apparatus 20 according to the second example embodiment will be described with reference to
As shown in
Next, the speech recognition unit 22 generates the embedded vector using the sensor data acquired in step B1 (step B2). Specifically, for example, when the sensor data is image data, the speech recognition unit 22 generates the embedded vector by the method shown in
Next, the speech recognition unit 22 converts the speech data into text data by applying the speech data acquired in step B1 and the embedded vector generated in step B2 to the acoustic model (step B3). Further, the acoustic model used in step B3 is constructed by executing steps A1 to A3 shown in
Steps B1 to B3 are executed each time the speech data to be recognized and the sensor data are acquired. Further, the speech data is accurately recognized by steps B1 to B3.
As described above, according to the second example embodiment, it is possible to execute speech recognition by using the embedded vector generated without using speech recognition.
As described in the modification 1 of the first example embodiment described above, also in the second example embodiment, the sensor data may be a combination of two or more among the image data, the temperature data, the location data, the time data, and the illuminance data. In the case that, the data acquisition unit 21 acquires all the combined sensor data. Further, in this case, the speech recognition unit 22 generates the embedded vector for each of the combined sensor data. Then, the speech recognition unit 22 applies the embedded vector for each generated sensor data to the acoustic mode to converts the speech data into text data.
[Program]
It is sufficient that the program according to the second example embodiment be a program that causes a computer to execute steps B1 to B3 illustrated in
Also, the program according to the second example embodiment may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as one of the data acquisition unit 21 and the speech recognition unit 22.
[Modified example]
Subsequently, a modified example of the speech recognition apparatus according to the second example embodiment will be described with reference to FIG. 10.
As shown in
With such the configuration, in this modified example, the speech recognition apparatus 20 can have a function as the acoustic model learning apparatus. In this modified example, it is possible to construct the acoustic model and perform speech recognition with one apparatus.
(Physical Configuration)
Here, a computer that realizes the acoustic model learning apparatus 10 by executing the program according to the first example embodiment, and a computer that realizes the speech recognition apparatus 20 by realizing the program according to the second example embodiment will be described with reference to
As illustrated in
Note that the computer 110 may include a GPU (Graphics Processing Unit) or an FPGA (Field-Programmable Gate Array) in addition to the CPU 111 or in place of the CPU 111.
The CPU 111 carries out various types of computation by deploying the program (codes) according to the example embodiment stored in the storage device 113 to the main memory 112, and executing the codes in a predetermined order. The main memory 112 is typically a volatile storage device, such as a DRAM (Dynamic Random-Access Memory). Also, the program according to the first and second example embodiments is provided in a state where it is stored in a computer readable recording medium 120. Note that the program according to the present example embodiment may also be distributed over the Internet connected via the communication interface 117.
Furthermore, specific examples of the storage device 113 include a hard disk drive, and also a semiconductor storage device, such as a flash memory. The input interface 114 mediates data transmission between the CPU 111 and an input device 118, such as a keyboard and a mouse. The display controller 115 is connected to a display device 119, and controls displays on the display device 119.
The data reader/writer 116 mediates data transmission between the CPU 111 and the recording medium 120, and executes readout of the program from the recording medium 120, as well as writing of the result of processing in the computer 110 to the recording medium 120. The communication interface 117 mediates data transmission between the CPU 111 and another computer.
Also, specific examples of the recording medium 120 include: a general-purpose semiconductor storage device, such as CF (Compact Flash®) and SD (Secure Digital); a magnetic recording medium, such as Flexible Disk; and an optical recording medium, such as CD-ROM (Compact Disk Read Only Memory).
Note that the acoustic model learning apparatus 10 and the speech recognition apparatus 20 according to the example embodiments can also be realized by using items of hardware corresponding to respective components, rather than by using the computer with the program installed therein. Furthermore, a part of the acoustic model learning apparatus 10 and the speech recognition apparatus 20 may be realized by the program, and the remaining part of these apparatus may be realized by hardware.
A part or all of the aforementioned example embodiment can be described as, but is not limited to, the following (Supplementary note 1) to (Supplementary note 24).
(Supplementary Note 1)
A speech recognition apparatus comprising:
a data acquisition unit that acquires speech data and sensor data to be recognized,
a speech recognition unit that converts the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
(Supplementary Note 2)
The speech recognition apparatus according to Supplementary note 1, wherein the speech recognition unit generates the embedded vector from the acquired sensor data and converts the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
(Supplementary Note 3)
The speech recognition apparatus according to Supplementary note 1 or 2, further comprising:
an acoustic model construction unit that constructs the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
(Supplementary Note 4)
The speech recognition apparatus according to Supplementary note 3, wherein the acoustic model construction unit inputs the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
the acoustic model construction unit generates the embedded vector using the data output from the model and constructs the acoustic model using the generated embedded vector.
(Supplementary Note 5)
The speech recognition apparatus according to any one of Supplementary note s 1 to 4,
the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
(Supplementary Note 6)
An acoustic model learning apparatus comprising:
a data acquisition unit acquires speech data to be training data, teacher data to be the training data, and sensor data related to the training data,
an acoustic model construction unit that constructs an acoustic model by machine learning using embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
(Supplementary Note 7)
The acoustic model learning apparatus according to Supplementary note 6,
wherein the acoustic model construction unit inputs the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
the acoustic model construction unit generates the embedded vector using the data output from the model and constructs the acoustic model using the generated embedded vector.
(Supplementary Note 8)
The acoustic model learning apparatus according to Supplementary note 6 or 7, the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
(Supplementary Note 9)
A speech recognition method comprising:
a data acquisition step of acquiring speech data and sensor data to be recognized,
a speech recognition step of converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
(Supplementary Note 10)
The speech recognition method according to Supplementary note 9,
wherein, in the speech recognition step, generating the embedded vector from the acquired sensor data and converting the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
(Supplementary Note 11)
The speech recognition method according to Supplementary note 9 or 10, further comprising:
an acoustic model construction step of constructing the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
(Supplementary Note 12)
The speech recognition method according to Supplementary note 11,
wherein, in the acoustic model construction step,
inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
generating the embedded vector using the data output from the model, and
constructing the acoustic model using the generated embedded vector.
(Supplementary Note 13)
The speech recognition method according to any one of Supplementary note s 9 to 12,
the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
(Supplementary Note 14)
An acoustic model construction method comprising:
a data acquisition step of acquiring speech data to be training data, teacher data to be the training data, and sensor data related to the training data,
an acoustic model construction step of constructing an acoustic model by machine learning using embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
(Supplementary Note 15)
The acoustic model construction method according to Supplementary note 14,
Wherein, in the acoustic model construction step, inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input, and
generating the embedded vector using the data output from the model and constructing the acoustic model using the generated embedded vector.
(Supplementary Note 16)
The acoustic model construction method according to Supplementary note 14 or 15,
the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
(Supplementary Note 17)
A computer-readable recording medium that includes a program, the program including instructions that cause a computer to carry out:
a data acquisition step of acquiring speech data and sensor data to be recognized,
a speech recognition step of converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
(Supplementary Note 18)
The computer-readable recording medium according to Supplementary note 17,
wherein, in the speech recognition step, generating the embedded vector from the acquired sensor data and converting the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
(Supplementary Note 19)
The computer-readable recording medium according to Supplementary note 17 or 18, the program further including instruction that cause the computer to carry out:
an acoustic model construction step of constructing the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
(Supplementary Note 20)
The computer-readable recording medium according to Supplementary note 19,
wherein, in the acoustic model construction step,
inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
generating the embedded vector using the data output from the model, and
constructing the acoustic model using the generated embedded vector.
(Supplementary Note 21)
The computer-readable recording medium according to any one of Supplementary note s 17 to 20,
the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
(Supplementary Note 22)
A computer-readable recording medium that includes a program, the program including instructions that cause a computer to carry out:
a data acquisition step of acquiring speech data to be training data, teacher data to be the training data, and sensor data related to the training data,
an acoustic model construction step of constructing an acoustic model by machine learning using embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
(Supplementary Note 23)
The computer-readable recording medium according to Supplementary note 22,
Wherein, in the acoustic model construction step, inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input, and
generating the embedded vector using the data output from the model and constructing the acoustic model using the generated embedded vector.
(Supplementary Note 24)
The computer-readable recording medium according to Supplementary note 22 or 23,
the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
The invention has been described with reference to an example embodiment above, but the invention is not limited to the above-described example embodiment. Within the scope of the invention, various changes that could be understood by a person skilled in the art could be applied to the configurations and details of the invention.
As described above, according to the present invention, it is possible to perform the speech recognition using the embedded vector generated without using speech recognition. The present invention is effective for various system in which speech recognition is performed.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/006080 | 2/17/2020 | WO |