This application claims priority to Taiwan Application Serial No. 107141382, filed on Nov. 21, 2018. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
This disclosure relates to speech recognition techniques, and, more particularly, to a speech recognition system, a speech recognition method, and a computer program product applicable to a specific application scenario.
In general, a speech recognition system is used to convert a user's speech message into text data. A currently popular speech recognition system is called general-purpose speech recognition system, such as Google speech recognition system. A user's speech information can be converted through the general-purpose speech recognition system into a text, which can then be shown by communication software as a chat message, or be broadcast on a social media and viewed by the public. Therefore, a user does not need to key the text word by word. In addition, with the development of smart phones, a user can also control a smart phone to operate through his voices, with the help of the speech recognition system. It is thus known that speech recognition can be applied to a variety of applications and becomes more and more important in our daily life.
The common general-purpose speech recognition system can provide speech recognition result that is above the average standard. However, the texts and sentences used in general and specific application scenarios are quite different. Therefore, the texts and sentences used in the specific application scenarios, such as professional terms, literature works, specific groups, specific environments, etc., cannot be well recognized by the general-purpose speech recognition system. For instance, in medical terms, the speech input in Chinese “” may be converted to the text output in Chinese “,” such output result is obviously far from the original meaning, and may even be meaningless. However, the general-purpose speech recognition system provides the text recognition result without providing any other operation options or detailed information to allow a developer or a user to process subsequently. Besides, the general-purpose speech recognition system can output a written text, and the written text usually does not have detailed information, such as segmentation and word confidence. The general-purpose speech recognition system belongs to a cloud service, and a user can receive limited extra information. Therefore, in the general-purpose speech recognition system a user can hardly improve the imprecise speech recognition result, especially in a specific application scenario.
It is known from the above that in the use of the existing speech recognition system, how to solve the challenge that the speech recognition result is not good enough for specific application scenarios is becoming a research topic in the art.
The present disclosure provides a speech recognition mechanism to increase speech recognition accuracy.
In an exemplary embodiment, a speech recognition system according to the present disclosure is connectible to an external general-purpose speech recognition system, and comprises a processing unit configured for operating a plurality of modules, the plurality of modules comprising: a specific application speech recognition module configured for converting an inputted speech signal into a first phonetic text, the general-purpose speech recognition system converting the speech signal into a written text; a comparison module configured for receiving the first phonetic text from the specific application speech recognition module and the written text from the general-purpose speech recognition system, converting the written text into a second phonetic text, and aligning the second phonetic text with the first phonetic text based on similarity of pronunciation to output a phonetic text alignment result; and an enhancement module configured for receiving the phonetic text alignment result from the comparison module and constituting the phonetic text alignment result after a path weighting with the written text and the first phonetic text to form an outputting recognized text.
In another exemplary embodiment, a speech recognition method according to the present disclosure comprises: converting, by a specific application speech recognition module, an inputted speech signal into a first phonetic text, and converting, by a general-purpose speech recognition system, the speech signal into a written text; converting, by a comparison module, the written text into a second phonetic text, and aligning the second phonetic text with the first phonetic text based on similarity of pronunciation, to output a phonetic text alignment result; and receiving, by an enhancement module, the phonetic text alignment result from the comparison module, and constituting the phonetic text alignment result, after a path weighting, with the written text and the first phonetic text, to form an outputting recognized text.
In yet another exemplary embodiment, a computer program product for speech recognition according to the present disclosure, after a computer is loaded with and executes a computer program, completes the above-described speech recognition method.
The disclosure can be more fully understood by reading the following detailed description of the embodiments, with reference made to the accompanying drawings, wherein:
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be grasped, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawing.
In an embodiment, the processing unit 201 is a general purpose processor, a specific purpose processor, a traditional processor, a digital signal processor, multiple microprocessors, one or more microprocessors in combination with digital signal processor cores, a controller, a microprocessor, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), any other types of integrated circuit, state machine, advanced RISC machine (ARM), and the like.
In an embodiment, the input unit 202 is a device or a component that receives speech signals and provides the received speech signals to the storage unit 203. In another embodiment, the input unit 202 is a microphone that collects speech signals, or a device that receives speech signals from another sources (e.g., other devices or storage media).
In an embodiment, the storage unit 203 is any types of stationary or mobile random access memory (RAM), read-only memory (ROM), flash memory, hard drive or other similar devices, or a combination thereof.
Please refer to
The specific application speech recognition module 21 receives speech signals that the input unit 202 receives, converts the speech signals into a first phonetic text, and outputs the first phonetic text to the comparison module 22. In an embodiment, the written text is in Chinese or in words of any other languages, and the phonetic text represents pronunciation corresponding to the words. For instance, the written text in Chinese “” corresponds to a phonetic text “Zhe Shi Wen Zi.”
The comparison module 22 receives the first phonetic text from the specific application speech recognition module 21 and the written text from the general-purpose speech recognition system 1, and converts the written text into a second phonetic text. The comparison module 22 further aligns the second phonetic text with the first phonetic text based on similarity of pronunciation of each of the phonetic texts and outputs a phonetic text alignment result.
The enhancement module 23 receives the phonetic text alignment result from the comparison module 22, and constitutes the phonetic text alignment result, after a path weighting, with the written text and the first phonetic text. The result of the constitution is an outputting recognized text.
Please refer to
The distribution module 24 distributes the speech signals to the general-purpose speech recognition system 1 and the specific application speech recognition module 21. The distribution module 24, after receiving the speech signals from the input unit 202, distributes the speech signals to the general-purpose speech recognition system 1 and the specific application speech recognition module 21 at the same time.
Please refer to the embodiment of
In an embodiment, as shown in
Please refer to the embodiment of
The confusion phone path extending unit 232 receives the phonetic text alignment result that has its path weight determined by the path weighting unit 231, reads the phonetic confusion table 27, and extends similar phones of the phonetic text in a parallel manner based on the pronounce of the lower confidence value during the recognition process. The weights of the similar phones will refer to the above result of the path weighting. Confusion phones can be obtained by prior knowledge or data-driven method. The prior knowledge is derived based on an acoustics theory. The data-driven learns which phones are likely to be confused with each other based on experiments. Each of the second phonetic text and the first phonetic text has a confidence value, and the confusion phone path extending unit 232 expands the similar phones for each phonetic text that has a confidence value lower than a threshold value in a parallel manner. The weights of each similar phone refer to the distribution weight of the path weighting.
The word constitution unit 233 reads the specific application phonetic-vocabulary mapping table, converts phonetic text paragraphs that may constitute specific application terms into the terms, and constitutes the phonetic text alignment result, the written text and the first phonetic text with respect to the specific application phonetic vocabularies. When constituted, terms dedicated to specific applications have high priority, and general terms have low priority. The word constitution unit 233 receives the phonetic text alignment result, the written text and the first phonetic text, and outputs a recognized text. The paths and weights of the phonetic text alignment result can also be distributed by the path weighting unit 231 and the confusion phone path extending unit 232 to expand the paths.
Please refer to
Please refer to the embodiment of
Please refer to the embodiment of
The specific application speech recognition module 21 uses the phonetic texts established by the phonetic text corpus to recognize a search network. Please refer to the embodiment of
Please refer to the embodiment of
Please refer to the embodiment of
Please refer to the embodiment of
to calculate the weight value, wherein the S function is input with the values in the phonetic confusion table, b parameter controls the minimum value of the S function, r parameter controls the range of the S function, s parameter controls the variation rate of the S function, and d parameter controls the position of a turning point of the S function. The path weight of the phonetic text can be obtained by the above methods.
Please refer to the embodiment of
Please further refer to
Please refer to the comparison diagram of
Please refer to the embodiment of
In step S181, the speech recognition system 2 is connected to an external general-purpose speech recognition system 1, and receives a speech recognition result of the general-purpose speech recognition system 1. The speech recognition system 2 and the general-purpose speech recognition system 1 can be referred to the above, further description thereof omitted.
In step S182, voices are received. When a user inputs speech signals, the message of a voice is received immediately. The input unit 202 receives the speech signals and provides or stores the speech signals to the storage unit 203. The specific application speech recognition module 21 receives and converts the speech signals in the storage unit 203 into a first phonetic text. The general-purpose speech recognition system 1 also receives the same message of speech signals, and converts the speech signals into a written text. The distribution module 24 stored in the storage unit 203 can also receive the speech signals received by the input unit 202, and distribute the speech signals to the general-purpose speech recognition system 1 and the specific application speech recognition module 21.
In step S183, the phonetic text is aligned. The comparison module 22 of the speech recognition system 2 converts the written text from the general-purpose speech recognition system 1 into a second phonetic text. The comparison module 22 aligns the second phonetic text and the first phonetic text with the phonetic text based on similarity of pronunciation, to form a phonetic text alignment result.
In step S184, the outputting recognized text is formed. The enhancement module 23 of the speech recognition system 2 receives the phonetic text alignment result from the comparison module 22, distributes path weights to enable the phonetic text alignment result to comprise path weights, and constitutes the phonetic text alignment result having the path weights with the written text and the first phonetic text, to enhance the formed recognized text.
Please refer to the embodiment of
In step S191, the phonetic text converting unit 221 segments the written text. The segmentation algorithm 2211 is used to segment the written text. When used to segment the written text, the segmentation algorithm 2211 reads the pronunciation dictionary 2212 first, and segments the written text by referring to the pronunciation dictionary. The phonetic text converting unit 221 can also refer to an external pronunciation dictionary when segmenting the written text and finding pronounces thereof.
In step S192, the pronunciation dictionary is read, the segmented written text is converted into the corresponding phonetic text, and a second phonetic text is thus formed based on the segmented written text and the corresponding pronunciation dictionary.
In step S193, the phonetic text aligning unit 222 converts the phonetic text representation that does not contain the segmentation information, After the second phonetic text and the first phonetic text are received, the phonetic text that does not contain the segmentation information is converted to form the segmented second phonetic text and first phonetic text. The phonetic text aligning unit 222 can dynamically program the second phonetic text and the first phonetic text to obtain the corresponding phonetic text paragraphs.
In step S194, a distance matrix is initialized, to convert the segmented second phonetic text and first phonetic text into the distance matrix.
In step S195, a cost of an aligned path is calculated based on similarity of pronunciation. The alignment path can be calculated with respect to the distance matrix formed by the second phonetic text and the first phonetic text. The alignment path can employ the shortest path method.
In step S196, an aligned path is searched. After the calculation of the alignment path, the alignment path is searched to form the alignment result. The alignment result can be represented by a graph (e.g., a lattice graph or a sausage graph).
Please further refer to
In step S201, path weights are distributing based on a confusion degree. The path weighting unit 231 receives the phonetic text alignment result, reads the phonetic confusion table, and distributes path weights of the phonetic text alignment result based on the confusion degree.
In step S202, the confusion phone paths of the phonetic text are expanded in a parallel manner based on the confidence value of the phonetic text. The phonetic text alignment result, after the path weight distribution and the confusion phone path expansion, can be read into the phonetic confusion table. Each of the second phonetic text and the first phonetic text has a confidence value. When the confidence value is lower than a threshold value, the confusion phone path extending unit expands the similar phones for each phonetic text in a parallel manner, and the weights of each similar phones refer to the distribution weight of the path weight distribution.
In step S203, the phonetic text converts the specific application terms. The word constitution unit 233 reads the specific application phonetic-vocabulary mapping table, and converts the phonetic text alignment result and the first phonetic text into the specific application phonetic vocabularies.
In step S204, words are merged. The words of the specific application phonetic vocabularies converted from the phonetic text alignment result and the first phonetic text and the written text output by the general-purpose speech recognition system 1 are merged, to form the enhanced recognition result.
The present disclosure further provides a computer program product for speech recognition. When the computer is loaded with a computer program, the above speech recognition method is complete.
In sum, a speech recognition system and a speech recognition method according to the present disclosure can assist a general-purpose speech recognition system, and further improve the recognition effect in certain application scenarios.
It will be grasped to those skilled in the art that various modifications and variations can be made to the disclosed embodiments. It is intended that the specification and examples be considered as exemplary, with a true scope of the disclosure being indicated by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
107141382 | Nov 2018 | TW | national |
Number | Name | Date | Kind |
---|---|---|---|
5754978 | Perez-Mendez et al. | May 1998 | A |
6122613 | Baker | Sep 2000 | A |
6526380 | Thelen et al. | Feb 2003 | B1 |
6757652 | Lund et al. | Jun 2004 | B1 |
6912498 | Stevens et al. | Jun 2005 | B2 |
7058573 | Murveit et al. | Jun 2006 | B1 |
7315818 | Stevens et al. | Jan 2008 | B2 |
7949524 | Saitoh et al. | May 2011 | B2 |
8521539 | Teng | Aug 2013 | B1 |
9117453 | Bielby | Aug 2015 | B2 |
9405742 | Walther | Aug 2016 | B2 |
9940927 | Georges et al. | Apr 2018 | B2 |
9953653 | Newman et al. | Apr 2018 | B2 |
10403267 | Park | Sep 2019 | B2 |
20060149551 | Ganong, III | Jul 2006 | A1 |
20100121638 | Pinson | May 2010 | A1 |
20110131038 | Oyaizu | Jun 2011 | A1 |
20130346078 | Gruenstein | Dec 2013 | A1 |
20150281401 | Le | Oct 2015 | A1 |
20160133248 | Bouk | May 2016 | A1 |
20160133249 | Kim | May 2016 | A1 |
20170069311 | Grost | Mar 2017 | A1 |
20170133008 | Wang | May 2017 | A1 |
20180020456 | Wan et al. | Jan 2018 | A1 |
20180114522 | Hall et al. | Apr 2018 | A1 |
20180211669 | Corcoran et al. | Jul 2018 | A1 |
20190294674 | Wang | Sep 2019 | A1 |
Number | Date | Country |
---|---|---|
103474069 | Dec 2013 | CN |
106328148 | Jan 2017 | CN |
106782561 | May 2017 | CN |
2017181727 | Oct 2017 | JP |
187192 | Mar 1980 | TW |
57459 | Mar 1984 | TW |
200412730 | Jul 2004 | TW |
M338396 | Aug 2008 | TW |
I319563 | Jan 2010 | TW |
201030538 | Aug 2010 | TW |
I512655 | Dec 2015 | TW |
201621882 | Jun 2016 | TW |
104128223 | Jun 2017 | TW |
2017182850 | Oct 2017 | WO |
Entry |
---|
Dynamic Improvements in a Cloud-Based Speech Recognition Engine by Incorporating Trending Data—2014 4th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering(MobileCloud); Milind Bhavsar; Prudhvi Kosaraju; G. Ananthakrishnam;Gurudas Subray Shet; Saurav Anand; 2016 4th IEEE Internaional Conference on Mobile Cloud Computing services, and Engineering (MobileCloud); Apr. 1, 2016; 60-66. |
A Cloud-based framework for Thai Large Vocabulary Speech Recognition; Sila Chunwijitra, Chanchai Junlouchai, Kamthorn Krairaksa;—2016 13th International Conference on Electical Engineering/Electronics, Computer, Telecommunications and Information Technology(ETCI-CON); Jun. 28, 2016; 1-6. |
Speech Recognition with Deep Recurrent Neural Networks—Department of Computer Science, University of Toronto, 2013—Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton; arXiv; Mar. 22, 2013; 2-5. |
EESEN: End-to-End Speech Recognition using deep RNN Models and WFST-Based Decoding′ Yajie Miao, Mohammad Gowayyed, Florian Metze; ASRU 2015; Dec. 13, 2015; 167-174-Language Technologies Institute, School of Computer Science, Carnegie Melon University, 2015. |
End-to-End Attention-Based large Vocabulary Speech Recofnition; Dzmitry Bahdanauy, Jan Chrowskiz Dmitriy, Serdyuky, Phil'emon Brakely and Yoshua Bengio; ICASSP 2016; Mar. 20, 2016; 4945-4949-Universite de Montreal, University of Wroctaw, and CIFAR, 2016. |
Listen, Attention and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition; William Chan, Navdeep Jaitly, Quoc Le Oriol Vinyals; ICASSP 2016; Mar. 20, 2016; 4960-4964-CMU and Google Brain, 2016. |
Number | Date | Country | |
---|---|---|---|
20200160850 A1 | May 2020 | US |