The present application is based on and claims priority to Chinese Application No. 202210426886.0 filed on Apr. 21, 2022, the disclosure of which is incorporated by reference herein in its entirety.
The present disclosure relates to the field of information technologies, and in particular a speech recognition method and apparatus, an electronic device, and a storage medium.
In speech recognition, high recognition accuracy is usually obtained with the aid of language models. Common language models include the Long Short Term Memory (LSTM) model.
In the process of performing speech recognition based on the LSTM model, the most accurate one will be usually determined from a plurality of Chinese characters or syllables corresponding to an audio frame at a certain time through a series of operations.
Embodiments of the present disclosure provide a speech recognition method and apparatus, an electronic device, and a storage medium.
An embodiment of the present disclosure provides a speech recognition method, comprising:
An embodiment of the present disclosure further provides a speech recognition apparatus, comprising:
An embodiment of the present disclosure provides an electronic device, which comprises:
An embodiment of the present disclosure further provides a computer readable storage medium storing thereon a computer program which, when executed by a processor, implements the speech recognition method described above.
An embodiment of the present disclosure further provides a computer program product comprising a computer program or instructions which, when executed by a processor, cause the processor to implement the speech recognition method described above.
An embodiment of the present disclosure further provides a computer program, comprising computer readable instructions which, when executed by a processor, cause the processor to implement the speech recognition method described above.
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed implementations when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals indicate the same or similar elements. It should be understood that the drawings are schematic, and the components and elements are not necessarily drawn to scale.
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth here, but rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the protection scope of the present disclosure.
It should be understood that the steps described in the method implementations of the present disclosure can be executed in a different order and/or in parallel. Furthermore, method implementations can include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
As used herein, the term “including” and its variants are open-ended including, that is, “including but not limited to”. The term “based on” is “at least partially based on”. The term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one other embodiment”; the term “some embodiments” means “at least some embodiments”. Related definitions of other terms will be given in the following description.
It is to be noted that the concepts of “first” and “second” mentioned in the present disclosure are only used to distinguish different apparatuses, modules or units, and are not used to limit the order or interdependence of the functions executed by these apparatuses, modules or units.
It is to be noted that the modifications of “one” and “a plurality” mentioned in the present disclosure are schematic rather than limiting, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as “one or a plurality of”.
Names of messages or information exchanged among a plurality of apparatuses in the embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the scope of these messages or information.
At present, in the process of performing speech recognition based on the LSTM model, there are problems of a large operation volume and a slow operation speed. The embodiments of the present disclosure provide a speech recognition method and apparatus, an electronic device, and a storage medium, so as to reduce the operation volume, and improve the operation speed and the speech recognition efficiency.
Generally, the LSTM model includes at least one cascaded processing layer. If the LSTM model includes a plurality of cascaded processing layers, then an output of a previous processing layer serves as an input of a next processing layer. Taking the LSTM model including three cascaded processing layers as an example, the schematic structural diagram of the LSTM model is shown in
In the process of processing the to-be-recognized speech segments through the LSTM model, a plurality of recognition units (Chinese characters or syllables) with high probability of corresponding to an audio frame at a certain time will be usually selected as an initial input of the LSTM model, namely an input of a first processing layer of the LSTM model. For example, the Chinese characters with high probability of corresponding to the audio frame at time t can be “”, “
” and “
” which serve as the input of the first processing layer of the LSTM model, and then a most accurate one is determined from the plurality of Chinese characters (“
”, “
” and “
”) as a speech recognition result corresponding to the audio frame at time t by performing a series of operations in each processing layer of the LSTM model.
Specifically, in the operation process of each processing layer, a historical state corresponding to the audio frame at a certain time will be added. Referring to the schematic structural diagram of a processing layer including a plurality of processing units as shown in
As learned from
The processing unit determines the output quantity ht of the unit based on the input data set Xt, the correlation quantity Ct−1 and the historical state data set ht−1 corresponding to the input data set Xt prior to the time t of the unit.
As shown in
Correspondingly, an input of the target processing unit includes an output of a last processing unit adjacent to the target processing unit, wherein the output of the last processing unit includes a historical state data set Ht−1 which includes a plurality of historical state subsets ht−1, and a cell state Ct−1. The input of the target processing unit also includes an input data set Xt which includes a plurality of input subsets xt. An output (Ct and ht) of the target processing unit serves as an input of a next processing unit adjacent to the target processing unit.
As learned from
For the above-mentioned operation problem of y, it is usually to first set xt unchanged and change ht−1 until each ht−1 in Ht−1 is traversed, and then change xt, and repeat the above process. The above processing mode has the following problem: every time ht−1 is changed, it is necessary to execute a matrix multiplication operation xt·A, which causes the matrix multiplication xt·A to be repeatedly executed. Similarly, after xt is changed, it is also necessary to repeatedly execute a matrix multiplication operation ht−1. B. Assuming that the number of xt is 10 and the number of ht−1 is 10, for the above-mentioned operation problem of y, it is necessary to execute the operation xt19 A+ht−1·B totally 100 times according to the existing processing mode, which has the problems of a large operation volume and a slow operation speed.
For the above problems, the embodiment of the present disclosure provides a speech recognition method, which aims to reduce the operation volume inside the LSTM model, and improve the operation speed, SO as to improve the speech recognition efficiency.
As shown in
Step 410: inputting a to-be-recognized speech segment to a Long Short Term Memory (LSTM) model.
Step 420: processing the to-be-recognized speech segment through the LSTM model so as to obtain a speech recognition result.
Illustratively, reference can be made to the schematic structural diagram of an LSTM model including a plurality of processing layers as shown in
A plurality of optional Chinese characters corresponding to a different audio frame at a different time are usually Chinese characters with high probability of corresponding to the speech recognition result of the audio frame. For example, Chinese characters with high probability of corresponding to an audio frame at time t are possibly “”, “
” and “
” which serve as an input data set Xt of a processing unit, and the input data set Xt includes a plurality of input subsets xt, each of which represents a vector of a Chinese character.
In summary, the input data set of the target processing unit includes vectors of a plurality of recognition units corresponding to the i-th audio frame of the to-be-recognized speech segment at time t and a first matching degree corresponding to each recognition unit; the historical state subset of the target processing unit includes vectors of a plurality of recognition units corresponding to the (i−1)-th audio frame of the to-be-recognized speech segment at time (t−1) and a second matching degree corresponding to each recognition unit; the output quantity of the target processing unit includes vectors of a plurality of recognition units corresponding to the i-th audio frame of the to-be-recognized speech segment at time t and a third matching degree corresponding to each recognition unit; wherein the third matching degree is different from the second matching degree, and the third matching degree is used for determining a speech recognition result of the i-th audio frame. That is, the matching degree between each Chinese character and the i-th audio frame can be changed through the processing of the target processing unit, and the speech recognition result of the i-th audio frame is finally obtained through the processing of a plurality of processing units.
Specifically, the output quantity of each of the processing units is determined based on a sum of the product of the input data set of the corresponding unit and the first matrix and the product of the historical data set of the corresponding unit and the second matrix. The input data set includes a plurality of input subsets, and the historical state data set includes a plurality of historical state subsets.
As learned from
For the above-mentioned operation problem of y, it is usually to first set xt unchanged and change ht−1 until each ht−1 in Ht−1 is traversed, then change xt, and repeat the above process. The above processing mode has the following problem: every time ht−1 is changed, it is necessary to execute a matrix multiplication operation xt·A, which causes the matrix multiplication xt·A to be repeatedly executed. Similarly, after xt is changed, it is also necessary to repeatedly execute a matrix multiplication operation ht−1. B. Assuming that the number of xt is 10 and the number of ht−1 is 10, for the above-mentioned operation problem of y, it is necessary to perform the operation xt·A+ht−1·B totally 100 times according to the existing processing mode, which has the problems of a large operation volume and a slow operation speed.
However, in the embodiment of the present disclosure, the above matrix multiplication operation (y=xt·A+ht−1·B, xt∈Xt, ht−1∈Ht−1) is performed through two single loops, rather than through one double loop as descried above.
Specifically, for each input subset xt in the input data set Xt, the product of each input subset xt and the first matrix A is respectively determined so as to obtain a third matrix. For example, there are totally 10 input subsets, which are denoted as x1, x2, x3, x4, x5, x6, x7, x8, x9 and x10 respectively. Then, the product of x1 and matrix A is calculated to obtain y11; the product of x2 and matrix A is calculated to obtain y12; the product of x3 and matrix A is calculated to obtain y13, the product of x4 and matrix A is calculated to obtain y14; the product of x5 and matrix A is calculated to obtain y15; the product of x6 and matrix A is calculated to obtain y16; the product of x7 and matrix A is calculated to obtain y17; the product of x8 and matrix A is calculated to obtain y18; the product of x9 and matrix A is calculated to obtain y19; the product of x10 and matrix A is calculated to obtain y110. Then the third matrix is [y11, y12, y13, y14, y15, y16, y17, y18, y19, y110]. To sum up, it requires to perform the multiplication operation with the first matrix A totally 10 times so as to obtain the third matrix. The above-mentioned process belongs to one single loop.
Similarly, for each historical state subset ht−1 in the historical state data set Ht−1 prior to the target time, the product of each historical state subset ht−1 and the second matrix B is determined respectively so as to obtain a fourth matrix. For example, there are totally 10 historical state subsets, which are respectively denoted as h1, h2, h3, h4, h5, h6, h7, h8, h9 and h10. Then, the product of hi and matrix B is calculated to obtain y21; the product of h2 and matrix B is calculated to obtain y22; the product of h3 and matrix B is calculated to obtain y23; the product of h4 and matrix B is calculated to obtain y24; the product of h5 and matrix B is calculated to obtain y25; the product of ho and matrix B is calculated to obtain y26; the product of h7 and matrix B is calculated to obtain y27; the product of h8 and matrix B is calculated to obtain y28; the product of h9 and matrix A is calculated to obtain y29; and the product of h10 and matrix B is calculated to obtain y210, respectively. Then a fourth matrix is [y21, y22, y23, y24, y25, y26, y27, y28, y29, y210]. To sum up, it requires to perform the multiplication operation with the second matrix B totally 10 times so as to obtain the fourth matrix, and the above process also belongs to one single loop. Therefore, in the embodiment of the present disclosure, the above-mentioned matrix multiplication operation (y=xt·A+ht−1·B, xt∈Xt, ht−1∈Ht−1) is performed through two single loops, which requires to perform the matrix multiplication operation totally 20 times. Compared with performing the matrix multiplication operation 100 times, the operation volume is reduced, and the operation efficiency is improved, so that the speech recognition efficiency is improved.
Further, an intermediate quantity is determined based on the third matrix and the fourth matrix, and the intermediate quantity is used for determining an output quantity for the target time of the corresponding unit. Specifically, the output quantity for the target time can be determined according to the operation formula of each logic gate in
In summary, the determining, for each input subset in the input data set of the corresponding unit, a product of each input subset and the first matrix respectively so as to obtain a third matrix comprises:
The determining, for each historical state subset in the historical state data set prior to the target time of the corresponding unit, a product of each historical state subset and the second matrix respectively so as to obtain a fourth matrix comprises:
The determining an intermediate quantity based on the third matrix and the fourth matrix includes: performing a matrix addition operation on the third matrix and the fourth matrix so as to obtain the intermediate quantity.
According to the speech recognition method provided by the embodiment of the present disclosure, when the matrix multiplication operation is performed inside the LSTM model, it is performed through two single loops, so that the operation volume can be reduced, and the operation speed and the speech recognition efficiency can be improved.
Optionally, the output quantity of each of the processing units is determined based on a sum of a product of the input data set of the corresponding unit and a first matrix and a product of the historical state data set of the corresponding unit and a second matrix. The input data set comprises a plurality of input subsets, and the historical state data set comprises a plurality of historical state subsets.
Optionally, the processing module 520 includes a first determination unit for determining, for each input subset in the input data set of the corresponding unit, a product of each input subset and the first matrix respectively so as to obtain a third matrix; a second determination unit for determining, for each historical state subset in the historical state data set prior to the target time of the corresponding unit, a product of each historical state subset and the second matrix respectively so as to obtain a fourth matrix; a third determination unit for determining an intermediate quantity based on the third matrix and the fourth matrix, which is used for determining an output quantity for the target time of the corresponding unit.
Optionally, the first determination unit is specifically used for determining, for a current input subset, a product of the current input subset and the first matrix as a first result corresponding to the current input subset, the current input subset being one of the plurality of input subsets; ranking the first result corresponding to each input subset according to a ranking relation of each input subset in the input data set so as to obtain a third matrix.
The second determination unit is specifically used for determining, for a current historical state subset, a product of the current historical state subset and the second matrix as a second result corresponding to the current historical state subset, the current historical state subset being one of the plurality of historical state subsets; ranking the second result corresponding to each historical state subset according to a ranking relation of each historical state subset in the historical state data set so as to obtain a fourth matrix.
Optionally, the third determination unit is specifically used for performing a matrix addition operation on the third matrix and the fourth matrix so as to obtain the intermediate quantity.
Optionally, the input data set of the target processing unit comprises vectors of a plurality of recognition units corresponding to the i-th audio frame of the to-be-recognized speech segment at time t and a first matching degree corresponding to each recognition unit; the historical state subset of the target processing unit comprises vectors of a plurality of recognition units corresponding to the (i−1)-th audio frame of the to-be-recognized speech segment at time (t−1) and a second matching degree corresponding to each recognition unit; the output quantity of the target processing unit comprises vectors of a plurality of recognition units corresponding to the i-th audio frame of the to-be-recognized speech segment at time t and a third matching degree corresponding to each recognition unit; wherein the third matching degree is different from the second matching degree, and the third matching degree is used for determining a speech recognition result of the i-th audio frame. The speech recognition apparatus provided by the embodiment of the present disclosure can execute the steps executed by the client or the server in the speech recognition method provided by the embodiment of the present disclosure, and has the execution steps and beneficial effects, which will not be described here.
As shown in
Generally, the following means can be connected to the I/O interface 505: an input means 506 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, and the like; an output means 507 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, and the like; a storage device 508 including, for example, a magnetic tape, a hard disk, and the like; and a communication means 509. The communication means 509 can allow the electronic device 500 to communicate wirelessly or via wire with other devices to exchange data. Although
In particular, according to an embodiment of the present disclosure, the process described above with reference to the flow diagram can be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product including a computer program carried on a non-transient computer readable medium. The computer program contains program codes for executing the method shown in the flow diagram, thereby realizing the speech recognition method as described above. In such an embodiment, the computer program can be downloaded and installed from the network through the communication means 509, or installed from the storage means 508 or from the ROM 502. When the computer program is executed by the processing means 501, the above functions defined in the method of the embodiment of the present disclosure are executed.
It is to be noted that the computer readable medium mentioned above in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of the computer readable storage medium can include, but are not limited to, an electrical connection with one or more wires, a portable computer disk, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM or flash memory), an optical fiber, a portable Compact Disk Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present disclosure, the computer readable storage medium can be any tangible medium containing or storing a program, which can be used by or in combination with an instruction execution system, apparatus or device. In the present disclosure, the computer readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, in which computer readable program codes are carried. This propagated data signal can take many forms, including but not limited to an electromagnetic signal, an optical signal or any appropriate combination of the above. The computer readable signal medium can also be any computer readable medium other than the computer readable storage medium, which can send, propagate or transmit a program for use by or in connection with an instruction execution system, apparatus or device. The program codes contained in the computer readable medium can be transmitted by any appropriate medium, including but not limited to: wires, optical cables, RF (radio frequency) and the like, or any appropriate combination of the above.
In some embodiments, clients and servers can communicate by using any currently known or future developed network protocol such as the HyperText Transfer Protocol (HTTP), and can be interconnected with digital data communication in any form or medium (for example, communication network). Examples of communication networks include a Local Area Network (“LAN”), a Wide Area Network (“WAN”), the Interconnection network (for example, the Internet) and end-to-end networks (for example, ad hoc end-to-end networks), as well as any currently known or future developed networks.
The above-mentioned computer readable medium can be contained in the above-mentioned electronic device, or can exist alone without being assembled into the electronic device.
The above-mentioned computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: input a to-be-recognized speech segment to a Long Short Term Memory (LSTM) model; process the to-be-recognized speech segment through the LSTM model so as to obtain a speech recognition result; wherein, the LSTM model comprises at least one processing layer, each of the processing layers comprises a plurality of processing units respectively, each of the processing units determines an output quantity for the target time of the corresponding unit through two single loops based on an input data set of the corresponding unit and a historical state data set prior to a target time, the target time is a time corresponding to the input data set of the corresponding unit, and the output quantity for each time prior to the target time comprises the historical state data set prior to the target time; an output of a former processing layer of two adjacent processing layers serves as an input of a latter processing layer, and an output of a former processing unit of two adjacent processing units serves as an input of a latter processing unit; an input data set of a first processing layer of the LSTM model comprises vectors of a plurality of recognition units respectively corresponding to each audio frame in the to-be-recognized speech segment, and an output of a last processing layer of the LSTM model is used for determining the speech recognition result. Optionally, when the above one or more programs are executed by the electronic device, the electronic device can also execute other steps described in the above embodiment.
Computer program codes for executing the operations of the present disclosure can be written in one or more programming languages or a combination thereof, including but not limited to object-oriented programming languages, such as Java, Smalltalk, C++, as well as conventional procedural programming languages, such as “C” language or similar programming languages. The program codes can be completely executed on the user's computer, partially executed on the user's computer, executed as an independent software package, partially executed on the user's computer and partially executed on a remote computer, or completely executed on the remote computer or server. In the case involving a remote computer, the remote computer can be connected to a user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or can be connected to an external computer (for example, through the Internet using an Internet service provider).
The flow diagrams and block diagrams in the drawings illustrate the architecture, functions and operations of possible implementations of the system, method and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flow diagrams or block diagrams can represent a module, a program segment, or a part of code that contains one or more executable instructions for implementing specified logical functions. It should also be noted that in some alternative implementations, the functions noted in the blocks can occur in a different order than those noted in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved. It is also to be noted that each block in the block diagrams and/or flow diagrams, and a combination of blocks in the block diagrams and/or flow diagrams, can be implemented by a dedicated hardware-based system that executes specified functions or operations, or by a combination of dedicated hardware and computer instructions.
The involved units described in the embodiments of the present disclosure can be implemented by software or hardware. The name of the unit does not constitute the limitation of the unit itself in some cases.
The functions described above herein can be at least partially executed by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that can be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD) and so on.
In the context of the present disclosure, a machine readable medium can be a tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus or device. The machine readable medium can be a machine readable signal medium or a machine readable storage medium. The machine readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any appropriate combination of the above. More specific examples of the machine readable storage medium can include an electrical connection based on one or more wires, a portable computer disk, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM or flash memory), an optical fiber, a convenient Compact Disk Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any appropriate combination of the above.
According to one or more embodiments of the present disclosure, the present disclosure provides a speech recognition method, including: inputting a to-be-recognized speech segment to a Long Short Term Memory (LSTM) model; processing the to-be-recognized speech segment through the LSTM model so as to obtain a speech recognition result; wherein, the LSTM model comprises at least one processing layer, each of the processing layers comprises a plurality of processing units respectively, each of the processing units determines an output quantity for the target time of the corresponding unit through two single loops based on an input data set of the corresponding unit and a historical state data set prior to a target time, the target time is a time corresponding to the input data set of the corresponding unit, and the output quantity for each time prior to the target time comprises the historical state data set prior to the target time; an output of a former processing layer of two adjacent processing layers serves as an input of a latter processing layer, and an output of a former processing unit of two adjacent processing units serves as an input of a latter processing unit; an input data set of a first processing layer of the LSTM model comprises vectors of a plurality of recognition units respectively corresponding to each audio frame in the to-be-recognized speech segment, and an output of a last processing layer of the LSTM model is used for determining the speech recognition result. According to one or more embodiments of the present disclosure, in the speech recognition method provided by the present disclosure, optionally, the output quantity of each of the processing units is determined based on a sum of a product of the input data set of the corresponding unit and a first matrix and a product of the historical state data set of the corresponding unit and a second matrix; the input data set comprises a plurality of input subsets, and the historical state data set comprises a plurality of historical state subsets.
According to one or more embodiments of the present disclosure, in the speech recognition method provided by the present disclosure, optionally, each of the processing units determining an output quantity for the target time of the corresponding unit through two single loops based on an input data set of the corresponding unit and a historical state data set prior to a target time comprises: determining, for each input subset in the input data set of the corresponding unit, a product of each input subset and the first matrix respectively so as to obtain a third matrix; determining, for each historical state subset in the historical state data set prior to the target time of the corresponding unit, a product of each historical state subset and the second matrix respectively so as to obtain a fourth matrix; determining an intermediate quantity based on the third matrix and the fourth matrix, which is used for determining an output quantity for the target time of the corresponding unit. According to one or more embodiments of the present disclosure, in the speech recognition method provided by the present disclosure, optionally, the determining, for each input subset in the input data set of the corresponding unit, a product of each input subset and the first matrix respectively so as to obtain a third matrix comprises: determining, for a current input subset, a product of the current input subset and the first matrix as a first result corresponding to the current input subset, the current input subset being one of the plurality of input subsets; ranking the first result corresponding to each input subset according to a ranking relation of each input subset in the input data set so as to obtain a third matrix.
According to one or more embodiments of the present disclosure, in the speech recognition method provided by the present disclosure, optionally, the determining, for each historical state subset in the historical state data set prior to the target time of the corresponding unit, a product of each historical state subset and the second matrix respectively so as to obtain a fourth matrix comprises: determining, for a current historical state subset, a product of the current historical state subset and the second matrix as a second result corresponding to the current historical state subset, the current historical state subset being one of the plurality of historical state subsets; ranking the second result corresponding to each historical state subset according to a ranking relation of each historical state subset in the historical state data set so as to obtain a fourth matrix.
According to one or more embodiments of the present disclosure, in the speech recognition method provided by the present disclosure, optionally, the determining an intermediate quantity based on the third matrix and the fourth matrix includes: performing a matrix addition operation on the third matrix and the fourth matrix so as to obtain the intermediate quantity.
According to one or more embodiments of the present disclosure, in the speech recognition method provided by the present disclosure, optionally, the input data set of the target processing unit comprises vectors of a plurality of recognition units corresponding to the i-th audio frame of the to-be-recognized speech segment at time t and a first matching degree corresponding to each recognition unit; the historical state subset of the target processing unit comprises vectors of a plurality of recognition units corresponding to the (i−1)-th audio frame of the to-be-recognized speech segment at time (t−1) and a second matching degree corresponding to each recognition unit; the output quantity of the target processing unit comprises vectors of a plurality of recognition units corresponding to the i-th audio frame of the to-be-recognized speech segment at time t and a third matching degree corresponding to each recognition unit; wherein the third matching degree is different from the second matching degree, and the third matching degree is used for determining a speech recognition result of the i-th audio frame.
According to one or more embodiments of the present disclosure, the present disclosure provides a speech recognition apparatus, comprising: an input module for inputting a to-be-recognized speech segment to a Long Short Term Memory (LSTM) model; a processing module for processing the to-be-recognized speech segment through the LSTM model so as to obtain a speech recognition result; wherein the LSTM model comprises at least one processing layer, each of the processing layers comprises a plurality of processing units respectively, each of the processing units determines an output quantity for the target time of the corresponding unit through two single loops based on an input data set of the corresponding unit and a historical state data set prior to a target time, the target time is a time corresponding to the input data set of the corresponding unit, and the output quantity at each time prior to the target time comprises the historical state data set prior to the target time; an output of a former processing layer of two adjacent processing layers serves as an input of a latter processing layer, and an output of a former processing unit of two adjacent processing units serves as an input of a latter processing unit; an input data set of a first processing layer of the LSTM model comprises vectors of a plurality of recognition units respectively corresponding to each audio frame in the to-be-recognized speech segment, and an output of a last processing layer of the LSTM model is used for determining the speech recognition result.
According to one or more embodiments of the present disclosure, the present disclosure provides an electronic device, comprising:
According to one or more embodiments of the present disclosure, the present disclosure provides a computer readable storage medium storing thereon a computer program which, when executed by a processor, implements any of the speech recognition methods as provided by the present disclosure.
According to one or more embodiments of the present disclosure, the present disclosure provides a computer program, comprising computer readable instructions which, when executed by a processor, cause the processor to implement any of the speech recognition methods as provided by the present disclosure.
Compared with the related art, the technical solution provided by the embodiments of the present disclosure has at least the following advantages: the speech recognition method provided by the embodiments of the present disclosure determines an output quantity for the target time of the corresponding unit through two single loops, which can reduce the operation volume and improve the operation speed and the speech recognition efficiency.
The above description is only the explanation of the preferred embodiments of the present disclosure and the applied technical principles. It should be understood by those skilled in the art that the disclosure scope involved in the present disclosure is not limited to the technical solution formed by the specific combination of the above technical features, but also covers other technical solution formed by any combination of the above technical features or their equivalent features without departing from the above disclosure concept, for example, a technical solution formed by a mutual replacement of the above features technical features with similar functions as disclosed in the present disclosure (but not limited to).
Furthermore, although the operations are depicted in a particular order, this should not be understood as requiring that these operations be executed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be beneficial. Likewise, although several specific implementation details are contained in the above discussion, these should not be construed as limiting the scope of the present disclosure. Some features described in the context of a separate embodiment can also be implemented in combination in a single embodiment. On the contrary, various features described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any appropriate sub-combination.
Although the present subject matter has been described in a language specific to structural features and/or methodological logical actions, it should be understood that the subject matter defined in the appended Claims is not necessarily limited to the specific features or actions described above. On the contrary, the specific features and actions described above are only exemplary forms of implementing the Claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202210426886.0 | Apr 2022 | CN | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/CN2023/085410 | 3/31/2023 | WO |