This application relates to voice recognition in general, and to a phoneme sound based controller, in particular.
There are many applications for technology that recognizes the spoken word, with voice activated devices expected to be the next big disruptor in consumer technology. A specific area which may be in need of improvement however, is voice control, such as, for example in noisy environments and/or with false positives. There are reports of voice activated speakers and smart devices performing unexpected actions because they heard sound that was incorrectly interpreted as a voice control command In some circumstances, traditional voice controls can mistakenly hear control phrases. One known solution may be to change the control phrase to be another control phrase which is less likely to have false positives, and/or disabling the action that was originally to be voice controlled. Yet another solution may be to change the response from simply executing the command to asking the user to confirm the command before executing the command For at least these reasons, there may be a need for improvements in voice activated devices generally, and voice control specifically.
According to one aspect of the present application, there is provided a phoneme sound based controller. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a phoneme sound based controller apparatus, the apparatus including: a sound input for receiving a sound signal; a phoneme sound detection module connected to the sound input to determine if at least one phoneme is detected in the sound signal; a dictionary containing at least one word, the word including at least one syllable, the syllable including the at least one phoneme; a grammar containing at least one rule, the at least one rule containing the at least one word, the at least one rule further containing at least one control action. In the phoneme sound based controller apparatus at least one control action is taken if the at least one phoneme is detected in the sound input signal by the phoneme sound detection module. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
Implementations may include one or more of the following features. The apparatus according further including a detection output for providing a signal representing the determination by the phoneme sound detection module. The apparatus according further including a speech recognition engine connected to the sound input, the speech recognition engine providing a speech recognition context including the at least one word if the speech recognition engine recognizes the presence of the at least one word in the sound input. The apparatus according further including a result output, the result output including the at least one word if the detection output indicates that the at least one phoneme is detected in the input signal and the at least one word is recognized in the sound input. The apparatus according further including a result output, the result output including the at least one word if the detection output indicates that the at least one phoneme is detected in the input signal. The apparatus according where the phoneme sound detection module includes at least one phoneme sound attribute detection module to detect the presence of a predetermined phoneme sound attribute of the at least one phoneme in the sound signal. The apparatus according where the at least one phoneme sound attribute includes a frequency signature corresponding to the at least one phoneme. The apparatus according where the frequency signature includes an impulse frequency phoneme sound attribute. The apparatus according where the frequency signature includes a wideband frequency phoneme sound attribute. The apparatus according where the frequency signature includes a narrowband frequency phoneme sound attribute. The apparatus according where the at least one phoneme sound attribute includes at least one sound amplitude corresponding to the at least one phoneme. The apparatus according where the at least one phoneme sound attribute includes at least one sound phase corresponding to the at least one phoneme. The apparatus according further including at least one calibration profile including at least one phoneme attribute threshold value relative to which the at least one phoneme sound attribute detection module detects the presence of the predetermined phoneme sound attribute of the at least one phoneme in the sound signal. The apparatus according where the at least one phoneme sound attribute detection module determines that the predetermined phoneme sound attribute is greater than the at least one phoneme attribute threshold value. The apparatus according where the at least one phoneme sound attribute detection module determines that the predetermined phoneme sound attribute is less than the at least one phoneme attribute threshold value. The apparatus according where the at least one phoneme sound attribute detection module determines that the predetermined phoneme sound attribute is within a predetermined range relative to the at least one phoneme attribute threshold value. The apparatus according where the phoneme sound detection module is a composite phoneme sound detection module including at least two phoneme sound detection modules. The apparatus according where the phoneme sound detection module is a monolithic phoneme sound detection module. The apparatus according where the sound input includes at least one sound file. The apparatus according where the sound input includes at least one microphone. The apparatus according where the at least one phoneme includes a consonant sound phoneme. The apparatus according where the at least one phoneme includes a vowel sound phoneme. The apparatus according where the at least one phoneme includes a consonant digraph sound phoneme. The apparatus according where the at least one phoneme includes a short vowel sound phoneme. The apparatus according where the at least one phoneme includes a long vowel sound phoneme. The apparatus according where the at least one phoneme includes an other vowel sound phoneme. The apparatus according where the at least one phoneme includes a dipthong vowel sound phoneme. The apparatus according where the at least one phoneme includes a vowel sound influenced by r phoneme. The apparatus according where the dictionary includes at least one word selected from the following group of words: fast, slow, start or stop. The apparatus according where the at least one phoneme includes the /s/ phoneme. The apparatus according where the at least one control action includes an action to affect the speed of a metronome. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
Other aspects and features of the present application will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of a phoneme sound based controller in conjunction with the accompanying drawing figures.
Embodiments of the present application will now be described, by way of example only, with reference to the accompanying drawing figures, wherein:
Like reference numerals are used in different figures to denote similar elements.
Referring to the drawings, Reference is now made to
Where Application Specific Machine 100 is enabled for two-way communication, it will incorporate communication subsystem 140, including both a receiver 146 and a transmitter 144, as well as associated components such as one or more, preferably embedded or internal, antenna elements(not shown) if wireless communications are desired, and a processing module such as a digital signal processor (DSP) 142. As will be apparent to those skilled in the field of communications, the particular design of the communication subsystem 140 will be dependent upon the communications medium 180 in which the machine is intended to operate. For example, Application Specific Machine 100 may include communication subsystems 140 designed to operate within the 802.11 network, Bluetooth™ or LTE network, both those networks being examples of communications medium 180 including location services, such as GPS.
Communications subsystems 140 not only ensures communications over communications medium 180, but also application specific communications 147. An application specific processor 117 may be provided, for example to process application specific data, instructions, and signals, such as for example for GPS, near field, or other application specific functions such as digital sound processing. Depending on the application, the application specific processor 117 may be provided by the DSP 142, by the communications subsystems 140, or by the processor 110, instead of by a separate unit.
Network access requirements will also vary depending upon the type of communications medium 180. For example, in some networks, Application Specific Machine 100 is registered on the network using a unique identification number associated with each machine. In other networks, however, network access is associated with a subscriber or user of Application Specific Machine 100. Some specific Application Specific Machine 100 therefore require other subsystems 127 in order to support communications subsystem 140, and some application specific Application Specific Machine 100 further require application specific subsystems 127. Local or non-network communication functions, as well as some functions (if any) such as configuration, may be available, but Application Specific Machine 100 will be unable to carry out any other functions involving communications over the communications medium 1180 unless it is provisioned. In the case of LTE, a SIM interface is normally provided and is similar to a card-slot into which a SIM card can be inserted and ejected like a persistent memory card, like an SD card. More generally, persistent Memory 120 can hold many key application specific persistent memory data or instructions 127, and other instructions 122 and data structures 125 such as identification, and subscriber related information. Although not expressly shown in the drawing, such instructions 122 and data structures 125 may be arranged in a class hierarchy so as to benefit from re-use whereby some instructions and data are at the class level of the hierarchy, and some instructions and data are at an object instance level of the hierarchy, as would be known to a person of ordinary skill in the art of object oriented programming and design.
When required network registration or activation procedures have been completed, Application Specific Machine 100 may send and receive communication signals over the communications medium 180. Signals received by receiver 146 through communications medium 180 may be subject to such common receiver functions as signal amplification, frequency down conversion, filtering, channel selection and the like, analog to digital (A/D) conversion. A/D conversion of a received signal allows more complex communication functions such as demodulation and decoding to be performed in the DSP 142. In a similar manner, signals to be transmitted are processed, including modulation and encoding for example, by DSP 142 and input to transmitter 144 for digital to analog conversion, frequency up conversion, filtering, amplification and transmission over the communication medium 180. DSP 142 not only processes communication signals, but also provides for receiver and transmitter control. For example, the gains applied to communication signals in receiver 146 and transmitter 144 may be adaptively controlled through automatic gain control algorithms implemented in DSP 144. In the example system shown in
Communications medium 180 may further serve to communicate with multiple systems, including an other machine 190 and an application specific other machine 197, such as a server (not shown), GPS satellite (not shown) and other elements (not shown). For example, communications medium 180 may communicate with both cloud based systems and a web client based systems in order to accommodate various communications with various service levels. Other machine 190 and Application Specific Other machine 197 can be provided by another embodiment of Application Specific Machine 100, wherein the application specific portions are either configured to be specific to the application at the other machine 190 or the application specific other machine 197, as would be apparent by a person having ordinary skill in the art to which the other machine 190 and application specific other machine 197 pertains.
Application Specific Machine 100 preferably includes a processor 110 which controls the overall operation of the machine. Communication functions, including at least data communications, and where present, application specific communications 147, are performed through communication subsystem 140. Processor 110 also interacts with further machine subsystems such as the machine-human interface 160 including for example display 162, digitizer/buttons 164 (e.g. keyboard that can be provided with display 162 as a touch screen), speaker 165, microphone 166 and Application specific HMI 167. Processor 110 also interacts with the machine-machine interface 1150 including for example auxiliary I/O 152, serial port 155 (such as a USB port, not shown), and application specific MHI 157. Processor 110 also interacts with persistent memory 120 (such as flash memory), volatile memory (such as random access memory (RAM)) 130. A short-range communications subsystem (not shown), and any other machine subsystems generally designated as Other subsystems 170, may be provided, including an application specific subsystem 127. In some embodiments, an application specific processor 117 is provided in order to process application specific data or instructions 127, 137, to communicate application specific communications 147, or to make use of application specific subsystems 127.
Some of the subsystems shown in
Operating system software used by the processor 110 is preferably stored in a persistent store such as persistent memory 120 (for example flash memory), which may instead be a read-only memory (ROM) or similar storage element (not shown). Those skilled in the art will appreciate that the operating system instructions 132 and data 135, application specific data or instructions 137, or parts thereof, may be temporarily loaded into a volatile 130 memory (such as RAM). Received or transmitted communication signals may also be stored in volatile memory 130 or persistent memory 120. Further, one or more unique identifiers (not shown) are also preferably stored in read-only memory, such as persistent memory 120.
As shown, persistent memory 120 can be segregated into different areas for both computer instructions 122 and application specific PM instructions 127 as well as program data storage 125 and application specific PM data 127. These different storage types indicate that each program can allocate a portion of persistent memory 120 for their own data storage requirements. Processor 110 and when present application specific processor 117, in addition to its operating system functions, preferably enables execution of software applications on the Application Specific Machine 100. A predetermined set of applications that control basic operations, including at least data communication applications for example, will normally be installed on Application Specific Machine 100 during manufacturing. A preferred software application may be a specific application embodying aspects of the present application. Naturally, one or more memory stores would be available on the Application Specific Machine 100 to facilitate storage of application specific data items. Such specific application would preferably have the ability to send and receive data items, via the communications medium 180. In a preferred embodiment, the application specific data items are seamlessly integrated, synchronized and updated, via the communications medium 180, with the machine 110 user's corresponding data items stored or associated with an other machine 190 or an application specific other machine 197. Further applications may also be loaded onto the Application Specific Machine 100 through the communications subsystems 140, the machine-machine interface 150, or any other suitable subsystem 170, and installed by a user in the volatile memory 130 or preferably in the persistent memory 120 for execution by the processor 110. Such flexibility in application installation increases the functionality of the machine and may provide enhanced on-machine functions, communication-related functions, or both. For example, secure communication applications may enable electronic commerce functions and other such financial transactions to be performed using the Application Specific Machine 100.
In a data communication mode, a received signal such as a text message or web page download will be processed by the communication subsystem 140 and input to the processor 110, which preferably further processes the received signal for output to the machine-human interface 160, or alternatively to a machine-machine interface 150. A user of Application Specific Machine 100 may also compose data items such as messages for example, using the machine-human interface 1160, which preferably includes a digitizer/buttons 164 that may be provided as on a touch screen, in conjunction with the display 162 and possibly a machine-machine interface 150. Such composed data items may then be transmitted over a communication network through the communication subsystem 110. Although not expressly show, a camera can be used as both a machine-machine interface 150 by capturing coded images such as QR codes and barcodes, or reading and recognizing images by machine vision, as well as a human-machine interface 160 for capturing a picture of a scene or a user.
For audio/video communications, overall operation of Application Specific Machine 100 is similar, except that received signals would preferably be output to a speaker 134 and display 162, and signals for transmission would be generated by a microphone 136 and camera (not shown). Alternative voice or audio I/O subsystems, such as a voice message recording subsystem, may also be implemented on Application Specific Machine 100. Although voice or audio signal output is preferably accomplished primarily through the speaker 165, display 162 and applications specific MHI 167 may also be used to provide other related information.
Serial port 155 in
Communications subsystems 140, may include a short-range communications subsystem (not shown), as a further optional component which may provide for communication between Application Specific Machine 100 and different systems or machines, which need not necessarily be similar machines. For example, the other subsystems 170 may include a low energy, near field, or other short-range associated circuits and components or a Bluetooth™ communication module to provide for communication with similarly enabled systems and machines.
The exemplary machine of
Each component in
Having described the environment in which the specific techniques of the present application can operate, application specific aspects will be further described by way of example only.
Words are the smallest meaningful unit of a language, and are made of syllables. Syllables in turn, include only one vowel phoneme. Words are therefore clusters of syllables, each syllable including at least one vowel phoneme, and possibly one or more consonant phonemes. For example, the words start, stop, fast and slow only have one syllable each because they each only have one vowel phoneme. They each also have at least one consonant phoneme, specifically the /s/ phoneme. The fact that all these words have at least one phoneme in common can be used advantageously to help differentiate between the voice recognition of each of these words in non-ideal (e.g. noisy) environments, as will be explained in greater detail below. Phonemes are units of sound used by a language speaking community, and there pronunciation varies from community to community, and even from different individuals within a community Such variations can be mitigated through calibration of the specific phonemes that are relevant to words for a given application, as will also be explained in greater detail below.
The Specific Application 704 provides the end goal that is to be realized by the framework, such as for example, a voice controlled metronome, as will be described later. The Speech Recognition Engine 620 uses the Application Specific Grammar 726 abstractly and implements the necessary calls to a Platform Specific API 724, such as for example the Speech Recognition Engine in Microsoft Windows™, Android™, iOS™, or the like. The Phonemes 712 and their optional Calibration 728 are used by the Frequency Bandwidth Detection 610 in order to detect a pulse corresponding to the Phonemes 712. Syllables 718 relate Phonemes 718 to Words 722 and the Application Specific Grammar 726. Thus, even though Speech Recognition Engine 620 recognizes one of the Words 722, this is not considered a valid recognition by the controller unless the Frequency Bandwidth Detection 610 detects an impulse corresponding to at least one Phonemes 712 that is related to the said one of the Words 722, thereby advantageously avoiding false positives in e.g. a noisy environment, such as for example during a music session for a metronome Specific Application 704. Operationally, Sound Capture 708 captures Sound Data 714 that is used by DSP Calculator 710 (that uses the DSP 142 or Application Specific Processor 117 for example) and FFT 716 to detect an impulse corresponding to the Phonemes 712. The Platform Specific Wrapper 720, similarly to the Platform Specific API 724, ensures that the Generic Application 700, and the Specific Application 704, can be easily ported to different platforms. A preferred way of achieving this is to realize these Platform Specific elements using Wrapper classes that abstract away the platform specific dependencies.
For example, consider the commands Start, Stop, Fast and Slow (in the metronome example the speeds commands can be optional as long as speeds are limited to the values shown, and the initial speed has a default starting value). “Slow” is provided as a result if the detection of the /s/ Phoneme and the absence of the /t/, /p/ and /f/ Phonemes in the Phoneme Sound Detection 610 is determined, as follows. If an /s/ Phoneme is detected, then possible results include Start Stop Fast or Slow. If an /s/ and /f/ phoneme are detected, then the result is Fast. If an /s/ and /p/ phoneme are detected, then the result is Stop. If the /s/ Phoneme is detected and none of the /t/, /p/ and /f/ phonemes are detected, then the result is Slow. The table below illustrates this Boolean logic:
S l o w
S
t a r t
F a st
S
t o p
As this example shows, by combining the right phoneme signature detection blocks and Boolean logic in the Phoneme Sound Detection block, the need for the speech recognition engine 1550 can be eliminated. If however the speech recognition engine 1550 is used, optional blocks 1570 are present, then an additional condition at step 1560 that the command word is detected can be considered at step 1530. Although not expressly shown in the drawing, sound input 1510 could be redirected from any sound source, including a microphone, sound file or sound track. At Result 1540, a corresponding action can be taken.
Although not expressly shown in the drawings, in alternative embodiments, the speech recognition engine uses an ASR (Automatic Speech Recognition) system that uses ML (machine learning) to improve its accuracy, by adapting the ASR with the following steps: (1) providing a welcome message to the user, to explain that their recordings will be used to improve the ASR's acoustic model; (2) providing a confirmation button or check box or the like to enable the user to give their consent; (3) looking up the next speech occurrence that has not been captured yet and presenting it to the user; (4) recording as the occurrence is being spoken by the user; (5) automatically sending the audio data to a predetermined directory; (6) enabling a person to review the audio data manually before including it in the ASR's ML mechanism; and (7) marking the recording for this occurrence for this user as processed.
The embodiments described herein are examples of structures, systems or methods having elements corresponding to elements of the techniques of this application. This written description may enable those skilled in the art to make and use embodiments having alternative elements that likewise correspond to the elements of the techniques of this application. The intended scope of the techniques of this application thus includes other structures, systems or methods that do not differ from the techniques of this application as described herein, and further includes other structures, systems or methods with insubstantial differences from the techniques of this application as described herein. Those of skill in the art may effect alterations, modifications and variations to the particular embodiments without departing from the scope of the application, which is set forth in the claims
This application is related to, and claims the benefit of priority from, U.S. Patent Application No. 62/910,313, filed on Oct. 3, 2019, entitled “PHONEME SOUND BASED CONTROLLER”, by Frédéric Borgeat.
Number | Date | Country | |
---|---|---|---|
62910313 | Oct 2019 | US |