This application claims priority to Taiwan Application Serial Number 106138180, filed Nov. 3, 2017, the entirety of which is herein incorporated by reference.
The present application relates to a voice controlling method and a system thereof. More particularly, the present application relates to a voice controlling method and system thereof for recognizing a specific term.
Recently, speech recognition technology has matured (e.g., Siri or google speech recognition). Users are also using voice input or voice control functions when operating electronic devices such as mobile devices or personal computers. However, there are homophones and special terms in Chinese, such as names, place names, company names or abbreviations, etc., so that the speech recognition system may not be able to recognize words accurately or even accurately recognize meaning of words.
In the current speech recognition method, the voice recognition system would establish the user's voiceprint information and lexical database in advance, but this will cause the system being only used for a particular user. Moreover, if there are many similar pronunciations of contacts, it will cause the wrong recognition of the speech recognition system. Therefore, the user still needs to adjust the recognized words, and it not only affects the accuracy of the speech recognition system but also affects the user's operational convenience. Therefore, how to solve the problem of inaccurate recognition of speech recognition system in specific vocabularies is one of the problems to be improved in the art.
An aspect of the disclosure is to provide a voice controlling method which is suitable for an electronic apparatus. The voice controlling method includes: inputting a voice and recognizing the voice to generate a sentence sample; generating at least one command keyword and at least one object keyword based on the sentence sample to perform a common sentence training; performing encoding conversion according to an initial, a vowel, and a tone of the at least one object keyword, generating a vocabulary coding set; utilizing the vocabulary coding set and an encoding database to perform a phonetic score calculation to generate a phonetic score and comparing the phonetic score and a threshold to generate at least one target vocabulary sample; comparing the at least one target vocabulary sample and a target vocabulary relation model to generate at least one audience information; and executing an operation corresponding to the at least one command keyword for the at least one audience information.
Another aspect of the disclosure is to provide a voice controlling system. In accordance with one embodiment of the present disclosure, the voice controlling system includes: a sentence training module, an encoding module, a score module, a vocabulary sample comparison module and an operation execution module. The sentence training module configured for performing a common sentence training according to a sentence sample, generating at least one command keyword and at least one object keyword. The encoding module is coupled with the sentence training module and configured for performing encoding conversion according to an initial, a vowel, and a tone of the at least one object keyword, generating a vocabulary coding set. The score module is coupled with the encoding module and configured for utilizing the vocabulary coding set and a encoding database to perform a phonetic score calculation to generate a phonetic score and comparing the phonetic score and a threshold to generate at least one target vocabulary sample. The vocabulary sample comparison module is coupled with the score module and configured for comparing the at least one target vocabulary sample and a target vocabulary relation model to generate at least one audience information. The operation execution module is coupled with the vocabulary sample comparison module and configured for executing an operation corresponding to the at least one command keyword for the at least one audience information.
Based on aforesaid embodiments, the voice controlling method and system thereof are capable of improving the inaccurate recognition of speech recognition system in specific vocabularies. It mainly utilized the deep neural network algorithm to find out keywords of the input sentence, and then analyzed the relationship between the initial, vowel and tone of keywords. It is capable of recognizing specific vocabularies without having pre-established the user's voiceprint information and lexical database. The disclosure overcomes the limitation that the speech recognition system is not properly identified due to different accents.
Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
As used herein, the term “initial” (also referred to as an “onset” or a “medial”) refers to an initial part of a syllable in the Chinese phonology. Generally, an initial may be a consonant.
As used herein, the term “vowel” refers to the remaining part of a syllable in the Chinese phonology by removing the initial of the syllable.
As used herein, the term “term” may be formed by one or more characters, and the term “character” may be formed by one or more symbols.
As used herein, the term “symbol” refers to a numeral symbol (e.g., “0”, “1”, “2”, “3”, “4” . . . ), an alphabetical symbol (e.g., “a”, “b”, “c”, . . . ) or any other symbol that is used in a phonetic system.
Reference is made to
In the embodiment, the processing unit 110 can be implemented by a micro controller, a microprocessor, a digital signal processor, an application specific integrated circuit (ASIC), a logical circuitry or any equivalent circuits of the processing unit 110. The voice inputting unit 120 can be implemented by a microphone. The voice outputting unit 130 can be implemented by a horn. The display unit 140 can be implemented by a LED displayer. The voice inputting unit 120, the voice outputting unit 130 and the display unit 140 can be implemented by any equivalent circuits. The memory unit 150 can be implemented by memory, hard disk, flash drive, memory card, etc. The transmitting unit 160 can be implemented by global system for mobile communication (GSM), personal handy-phone system (PHS), long term evolution (LTE), worldwide interoperability for microwave access (WiMAX), wireless fidelity (Wi-Fi), or Bluetooth, etc. The power supply unit 170 can be implemented by battery or any equivalent circuits of the power supply unit 170.
Reference is also made to
Reference is also made to
To be convenient for explanation, reference is made to
Afterward, the voice controlling method 300 executes step S320 to generate at least one command keyword and at least one object keyword based on the sentence sample to perform the common sentence training. The common sentence training is performing the segmentation of words for the input voice, generating the common sentence training set according to the intention words and keywords, utilizing deep neural networks (DNN) to generate the DNN sentence model. The DNN sentence model is able to interpret the input voice to the command keywords and the object keywords. The voice controlling method in this disclosure is analyzed and processed for the object keywords.
Afterward, the voice controlling method 300 executes step S330 to perform encoding conversion according to an initial, a vowel, and a tone of the at least one object keyword, and to generate a vocabulary coding set according to an encoding converted term. The encoding conversion is able to use different phonetic encoding, such as Tongyong pinyin phonetic translation system, Chinese pinyin phonetic translation system and Romanization phonetic translation system, etc. The phonetic score calculation in the embodiment is mainly performed the Chinese pinyin phonetic translation system an exemplary demonstration, but the disclosure is not limited to this.
Before executing step S340, it is necessary to generate the encoding database. Reference is also made to
Afterward, the voice controlling method 300 executes step S420 for utilizing a classifier to perform classification of relationship strength for data in the encoding database, generating the target vocabulary relation model. The disclosure is utilized support vector machines (SVM) to classify the data in the encoding database. Firstly, the data in the encoding database is transformed into eigenvectors to build SVM. SVM is configured to map eigenvectors into high-dimensional feature planes to create an optimal hyperplane. SVM is mainly applicable for two-class tasks, but it is able to combine multiple SVMs to solve multi-class task. Reference is also made to
SVM maps eigenvectors to high-dimensional feature planes to create an optimal hyperplane. SVM is mainly applied to two-class problems, but multiple SVMs can be combined to solve multi-class problems.
Reference is also made to
Afterward, the voice controlling method 300 executes step S3411 for determining whether a symbol quantity of the initial and the vowel of the first term with a symbol quantity of the initial and the vowel of the second term are identical. If step S3411 determines whether the symbol quantity of the initial and the vowel of the first term does not match a symbol quantity of the initial and the vowel of the second term, the voice controlling method 300 executes step S3412 for calculating a symbol quantity difference value between the symbol quantity of the initial and the vowel of the first term and the symbol quantity of the initial and the vowel of the second term.
If step S3411 determines that the symbol quantity of the initial and the vowel of the first term matches the symbol quantity of the initial and the vowel of the second term, the voice controlling method 300 executes step S3413 for determining whether a symbol of the initial and the vowel of the first term and a symbol of the initial and the vowel of the second term are identical or not. If step S3413 determines that the symbol of the initial and the vowel of the first term does not match the symbol of the initial and the vowel of the second term, the voice controlling method 300 executes step S3414 for calculating the difference score.
If step S3413 determines that the symbol of the initial and the vowel of the first term matches the symbol of the initial and the vowel of the second term, the voice controlling method 300 executes step S3415 for summing the symbol quantity difference value and the difference score to obtain an initial and vowel score.
Reference is also made to
As shown in
As shown in
According to the tone score rule in Table 1, the rule can be applied to this embodiments as shown in
As shown in
Afterward, the voice controlling method 300 further executes the step S340 of comparing aforesaid phonetic score and a threshold to generate at least one target vocabulary sample. The threshold can be set by different situations. For example, if the threshold is set as the maximum value of the phonetic score, which will select the most suitable database term. In aforesaid embodiments, the comparison result between the input term (chen2 de2 chen2) and the database term (chen2 de2 cheng2) will be selected, so the database term (chen2 de2 cheng2) can be found as the target vocabulary sample. In addition, the selection of threshold is not limited to be the maximum value of the phonetic score. It is able to select the first and second values of the phonetic score, or set a value greater than the value of the phonetic score will be as the target vocabulary sample.
As shown in
Afterward, the voice controlling method 300 further executes the step S360 of executing an operation corresponding to the at least one command keyword for the at least one audience information. Reference is also made to
In other embodiments, if the voice controlling system 100 can have more than two sets of keywords for identification and search, it is able to generate more accurate results. For example, user can ask questions like that I want to take the package to Wang xiao-ming in the management department, may I ask him? The object keywords are “management department” and “Wang xiao-ming”, and the voice controlling system 100 is able to find the related information of “management department” and “Wang xiao-ming”, such as the phone number, email or department, etc.
In other embodiments, if the voice controlling system 100 merely has single set of keywords for identification and search, it may find more than one target vocabulary sample. For example, if there is only one set of object keywords of “Wang xiao-ming”, there may be the situations of Wang Xiaoming from different departments. In this case, it is able to add new keywords to search again or the voice controlling system 100 is able to list multiple audience information of “Wang Xiaoming” for the user to select. Of course, it is able to utilize the keywords which are the most often used to perform the further operation automatically. For example, if Wang Xiaoming in the administration department is most often used as the object keywords, the voice controlling system 100 is able to help user directly contact Wang Xiaoming in the administration according to the common list.
Based on aforesaid embodiments, the voice controlling method and system thereof are capable of improving the inaccurate recognition of speech recognition system in specific vocabularies. It mainly utilized the deep neural network algorithm to find out keywords of the input sentence, analyzed the relationship between the initial, vowel and tone of keywords, and then performed the operation according to the related information. It is capable of recognizing specific vocabularies without establishing the user's voiceprint information and lexical database in advance. The disclosure overcomes the limitation that the speech recognition system is not properly identified due to different accents
The foregoing outlines features of several embodiments so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
106138180 | Nov 2017 | TW | national |