Traditional communication modalities and interactive systems require user inputs including voiced speech, typing and/or selection of various system inputs during use. Many of these interactive systems use various input methods and devices, such as microphones, keyboard/mouse devices and other devices and methods for receiving inputs from users. It is often desirable for these systems to support extended use, for example over periods of multiple hours, such that users may perform tasks without interruption from the system.
The inventors have recognized and appreciated that conventional interactive systems are unable meet the real-world needs of users. For example, it is not always practical for a user to enter text with a keyboard. Also, some existing systems accept user's voice as input to the systems. However, voice-based systems may not always be practical when the environment has noise (e.g., in a public place, in an office etc.) or privacy is of concern.
According to one aspect a wearable device is provided. The wearable device includes a plurality of sensors configured to measure signals at a face, head, or neck of a user, the signals being indicative of facial activity associated with an act performed by the user; a signal analysis component configured to analyze the signals and generate an activation signal responsive to results of the analysis of the signals; and a control component configured to activate the wearable device in response to the activation signal.
In some embodiments, the signal analysis component is configured to generate the activation signal in response to determining the user is silently speaking. The signal analysis component is configured to generate the activation signal in response to: determining the user is silently speaking or determining the user has performed a particular action. In some embodiments, the particular action includes the user clenching their teeth, tapping a cheek, or an action associated with the user preparing to speak.
In some embodiments, the control component is configured to activate the wearable device by activating a communication component of the device and transmitting the signals to a connected device. In some embodiments, the control component is configured to activate the wearable device by activating a processor of the device to determine one or more words or phrases from the measured signals.
In some embodiments, the signal analysis component is configured to generate the activation signal in response to determining a signal of the recorded signals is above a respective threshold. In some embodiments, wherein the plurality of sensors comprises: a plurality of EMG electrodes configured to record signals at a check of the user associated with movement of facial muscles of the user; and the signal analysis component is configured to generate the activation signal in response to determining a signal from the EMG electrodes is above a respective threshold. In some embodiments, the plurality of sensors further comprises: a microphone, and the signal analysis component is configured to generate the activation signal in response to determining the signal from the EMG electrodes is above the respective threshold and a voice level of the user in a signal recorded by the microphone is below a respective threshold voice level. In some embodiments, the plurality of sensors further comprises: an IMU configured to measure vibrations associated with voiced speech of the user, and the signal analysis component is configured to generate the activation signal in response to determining a signal from the EMG electrodes is above a respective threshold and a signal recorded by the IMU is below a respective threshold level.
In some embodiments, the signal analysis component is configured to: in response to determining the signal is above the respective threshold, perform template matching of parts of the signal to parts of a known speech signal; and in response to determining, based on the template matching, a measure of similarity between the signal and the known signal is above a threshold level of similarity, generate the activation signal. In some embodiments, the signal analysis component is further configured to: in response to determining the signal is above the respective threshold, perform template matching of parts of the signal to parts of a known action signal. In some embodiments, the signal analysis component is further configured to: in response to determining, based on the template matching, a measure of similarity between the signal and the known signal is below the threshold level of similarity and above a second level of similarity, analyze the signal using a trained machine learning model to determine whether the user is speaking; and generate the activation signal in response to determining the user is speaking.
In some embodiments, the plurality of sensors comprises: a plurality of EMG electrodes configured to record signals at a check of the user associated with movement of facial muscles of the user; and the signal analysis component is further configured to: determine a whether the wearable device has proper placement on the user by analyzing an impedance of a subset of EMG electrodes of the plurality of EMG electrodes; and in response to determining the wearable device has proper placement on the user, analyze the signals.
In some embodiments, the signal analysis component is further configured to: in response to determining the signal is above the respective threshold, perform analog to digital conversion of the signals at a first sampling frequency to generate first digital signals; analyze the first digital signals to determine a probability of whether the user is speaking; and generate the activation signal in response to determining the probability is above a first threshold probability. In some embodiments, the signal analysis component is further configured to perform analog to digital conversion of the signals at a second sampling frequency, higher than the first sampling frequency in response to determining the probability is below the first threshold probability and above a second threshold probability.
In some embodiments, the signal analysis component is further configured to: in response to determining the signal is above the respective threshold, analyze signals from a first subset of the sensors to determine a probability of whether the user is speaking; generate the activation signal in response to determining the probability is above a first threshold probability; and in response to determining the probability is below the first threshold probability and above a second threshold probability, analyze signals from a second subset of the sensors comprising more sensors than the first subset of sensors to determine whether the user is speaking.
According to another aspect, a method is provided. The method is for controlling a wearable device and includes: recording signals indicative of facial activity associated with an act performed by a user, at a face, head, or neck of the user; analyzing the recorded signals; and activating the wearable device based on the analyzing.
In some embodiments, the analyzing comprises determining a probability the user is silently speaking and the activating is performed in response to determining the probability is greater than a threshold probability. In some embodiments, the analyzing comprises determining a probability the user is silently speaking and determining a probability the user has performed a particular action and the activating is performed in response to determining the probability the user is silently speaking or the probability the user has performed a particular action is greater than a respective threshold probability.
Various aspects of at least one embodiment are discussed herein with reference to the accompanying figures, which are not intended to be drawn to scale. The figures are included to provide illustration and a further understanding of the various aspects and embodiments and are incorporated in and constitute a part of this specification but are not intended as a definition of the limits of the invention. Where technical features in the figures, detailed description or any claim are followed by reference signs, the reference signs have been included for the sole purpose of increasing the intelligibility of the figures, detailed description, and/or claims. Accordingly, neither the reference signs nor their absence is intended to have any limiting effect on the scope of any claim elements. In the figures, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every figure. In the figures:
To solve the above-described technical problems and/or other technical problems, the inventors have recognized and appreciated that silent speech or sub-vocalized speech may be particularly useful in communication and may be implemented in in interactive systems. In these systems, for example, users may talk to the system via silent speech or whisper to the system in a low voice for the purposes for providing input to the system and/or controlling the system. It is important for such systems to be available to users over extended periods of time in order to allow for effective and continued use and communication.
In at least some embodiments as discussed herein, silent speech is speech in which the speaker does not vocalize their words out loud, but instead mouths the words as if they were speaking with vocalization. These systems could enable users to enter a prompt by speech or communicate silently, but do not have the aforementioned drawbacks associated with voice-based systems. In some embodiments, as discussed herein, silent speech may include the speaker speaking without vocalization, whispering and/or speaking softly.
Accordingly, the inventors have developed new technologies for silent speech devices allowing the devices to automatically adapt to the current actions of the wearer, allowing for continued interaction with mobile devices, smart devices, communication systems and interactive systems. In some embodiments, the techniques may include a wearable device configured to recognize speech signals of a user including electrical signals indicative of a user's facial muscle movement when the user is speaking (e.g., silently or with voice), motion signals associated with the movement of a wearer's face, vibration signals associated with voiced speech, and/or audio signals, and change its operation in response to the signals. In some examples, the wearable device may additionally or alternatively measure a position of a user's tongue, blood flow of the user, muscle strain of the user, muscle frequencies of the user, temperatures of the user, and magnetic fields of the user, among other signals. Any such signal may be used with the technologies described herein.
The wearable device may comprise a sensor arm 110, supported by the ear hook 116. The sensor arm 110 may contain one or more sensors for recording speech signals from the user 101. The one or more sensors supported by the sensor arm may include EMG electrodes 111 configured to detect EMG signals associated with speech of the user. The EMG electrodes may be configured as an electrode array or may be configured as one or more electrode arrays supported by the sensor arm 110 of the wearable device 100.
In some examples, the wearable device may be connected to an external device which may provide inputs to the wearable device. For example, the wearable device may be connected to a smartphone, tablet, computer, laptop computer, desktop computer, a presentation device, smart watch, smart ring, another smart wearable device, among other devices. The external devices may have one or more input sensors which allow the user 101 to provide an input to the external device. The external device may then transmit a signal indicative of the input to the wearable device. The wearable device may then perform one or more actions in response to the signals indicative of the input to the external device.
In some examples, the EMG electrodes 111 may be configured as a differential amplifier, wherein the electrical signals represent a difference between a first voltage measured by a first subset of electrodes of the plurality of electrodes and a second voltage measured by a second subset of electrodes of the plurality of electrodes. Circuitry for the differential amplifier may be contained within the wearable device 100.
The sensor arm may support additional sensors 112. The additional sensors 112 may include a microphone for recording voiced or whispered speech, and an accelerometer or IMU for recording vibrations associated with speech such as glottal vibrations produced during voiced speech. In some examples the IMU may additionally or alternatively be used to measure facial movements. In some examples the wearable device includes multiple IMUs including at least one IMU configured to measure vibrations associated with speech and at least one IMU configured to measure facial movements. In some examples IMUs may be filtered at different frequencies, depending on whether they are measuring speech vibrations or facial motion. For example, IMU filtering at a lower frequency, for example 5-50 Hz, may measure facial motion related to speech and IMU filtering at a higher frequency, for example 100+Hz, may measure vibrations associated with speech. The additional sensors 112 may include sensors configured to measure a position of a user's tongue, blood flow of the user, muscle strain of the user, muscle frequencies of the user, temperatures of the user, and magnetic fields of the user, among other signals. The additional sensors 112 may include photoplethysmogram sensors, photodiodes, optical sensors, laser doppler imaging, mechanomyography sensors, sonomyography sensors, ultrasound sensors, infrared sensors, functional near-infrared spectroscopy (fNIRS) sensors, capacitive sensors, electroglottography sensors, electroencephalogram (EEG) sensors, and magnetoencephalography (MEG) sensors, among other sensors.
In some examples, the wearable device may comprise one or more sensors which detect whether the wearable device 100 is properly positioned on the user. In some examples, the wearable device may include an optical sensor which detects whether the device is properly positioned on the user, for example the optical sensor may determine placement within the ear of the user, or on the check of the user, among other locations. In some examples, the wearable device may determine the impedance levels of the EMG electrodes 111 and may determine if the wearable device 100 is properly positioned on the user based on the determined impedance levels. The EMG electrodes 111 may exhibit low impedance if the sensors are properly positioned on the user and high impedance if the sensors are not properly positioned. The wearable device 100 may determine the impedance level of any suitable number of EMG electrodes 111, for example, a single electrode may be checked, 2 electrodes may be checked, 3 electrodes may be checked, any subset of the electrodes may be checked, or all electrodes may be checked. The wearable device 100 may determine it is properly positioned when a threshold number of electrodes are determined to have low impedance. The threshold number of electrodes may be any suitable number of electrodes and may be based on the number of electrodes checked for impedance levels. For example, the threshold number of electrodes may be one electrode, 2 electrodes, 3 electrodes, or any subset of electrodes determined to have low impedance. In some examples, the wearable device may not record, analyze and/or process signals from sensors of the wearable device until it is determined the wearable device is properly positioned on the user 101.
In some examples, the wearable device 100 may determine whether the electrodes are properly placed by applying a known electrical current to the EMG electrodes 111. If the device is not properly placed on the user, one or more of the EMG electrodes may not be contacting the skin of the user. When the known current is applied to EMG electrodes which are not contacting the skin of the user, a voltage may be developed on the inputs of the EMG electrodes, which modulates the input values and is measurable at the EMG electrode outputs. Based on measured voltage, it may be determined the impedance of the of the signal from EMG electrode inputs is high. In some examples, analog to digital conversion of the voltage may be performed on the signal measured at the EMG electrode outputs to determine the impedance. The wearable device determines based on the measured voltage and/or determined impedance, the wearable device 100 is not properly placed on the user.
In some examples, when the known current is applied to EMG electrodes which are properly placed on the user (contacting the skin of the user), the current may flow through the EMG electrodes to the skin of the user and back to the device via another connection to the body, for example the reference electrode 114. This path for the known current is in parallel with the EMG electrode inputs, however has a much lower impedance than the EMG electrode inputs. The wearable device may determine the impedance based on the measured signals. The wearable device may therefore, based on the determined lower impedance, recognize the EMG electrodes are properly placed on the user.
In some examples, the known current is a DC current, which generates a constant voltage. The voltage may vary based on the impedance of the current path, which may be recognized by the wearable device 100. In some examples, the known current is an AC current, which is applied at a known frequency. In some examples, the known AC current may be characterized by its peak current. The known AC current creates a voltage which may be recognized by the wearable device 100 through extraction from sensor data using frequency decomposition techniques such as Fourier or quadrature methods. The wearable device 100 may perform EMG electrode and impedance measurement through the same sensor, when a known AC current is used because of the bandwidth of the signal having multiple current levels which can be used.
The ear hook 116 may additionally support one or more reference electrodes 114. The reference electrode 114 may be placed are located on a side of the ear hook 116, facing the user 101. In some examples reference electrode 114 may be configured to bias the body of the user such that the body is an optimal range for sensors of the system including sensors 112 and EMG electrodes 111. In some examples, the reference electrode 114 is configured to statically bias the body of the user. In some examples, the reference electrode 114 is configured to dynamically bias the body of the user.
The wearable device 100 may include a speaker 113 positioned at an end of the sensor arm. The speaker 113 is positioned at the end of the sensor arm 110 closest the user's ear. The speaker 113 may be inserted into the user's ear to play sounds, may perform bone conducting to play sounds to the user, or may play sounds aloud adjacent to the user's ear. The speaker 113 may be used to play outputs of silent speech processing or communication signals as discussed herein. In addition, the speaker 113 may be used to play one or more outputs from a connected external device, or the wearable device, such as music, audio associated with video or other audio output signals.
The wearable device 100 may include other components which are not pictured. These components may include a battery, a charging port, a data transfer port, among other components.
Wearable device 100 is an example of a wearable device which may be used in some embodiments of the technology described herein. Additional examples of wearable devices which may be used in some embodiments of the technology described herein include are described in U.S. patent application Ser. No. 18/338,827, titled “WEARABLE SILENT SPEECH DEVICE, SYSTEMS, AND METHODS”, filed Jun. 21, 2023, and with attorney docket number W1123.70003US00, the entirety of which is incorporated by reference herein.
With further reference to
In some embodiments, various sensors may be positioned at the first target zone 120. For example, electrodes (e.g., 111 in
In some embodiments, a second target zone 121 is shown along the jawline of the user. The second target zone 121 may include portions of the user's face above and under the chin of the user. The second target zone 121 may include portions of the user's face under the jawline of the user. The second target zone 121 may be used to measure electrical signals associated with muscles in the face, lips jaw and neck of the user, including the depressor labii inferioris of the user, the depressor anguli oris of the user, the mentalis of the user, the orbicularis oris of the user, the depressor septi of the user, the mentalis of the user, the platysma of the user and/or the risorius of the user. Various sensors may be placed at the second target zone 121. For example, electrodes (e.g., 111 in
In some embodiments, a third target zone 122 is shown at the neck of the user. The third target zone 122 may be used to measure electrical signals associated with muscles in the neck of the user, e.g., the sternal head of sternocleidomastoid of the user, or the clavicular head of sternocleidomastoid. Various sensors may be positioned at the third target zone 122. For example, accelerometers may be supported at the third target zone to measure vibrations and movement generated by the user's glottis during speech, as well as other vibrations and motion at the neck of user 101 produced during speech.
In some embodiments, a reference zone 123 may be located behind the ear of the user at the mastoid of the user. In some embodiments, reference electrodes (e.g., 114 in
With reference to
Specific modules are shown within the external device 220 and server 230, however these modules may be located within any of the wearable device 210, external device 220 and server 230. In some examples, the external device 220 may contain the modules of the server 230 and the wearable device 210 will communicate directly with the external device 220. In some examples, the server 230 may contain the modules on the external device 220 and the wearable device 210 will communicate directly with the server 230. In some examples, the wearable device 210 may contain the modules of both the external device 220 and the server 230 and therefore the wearable device 210 will not communicate with the external device 220 or the server 230 to determine one or more words or phrases from the signals 202 recorded by the wearable device. In some examples some modules of the server 230 and external device 220 may be included in the server 230, external device 220 and/or the wearable device 210. Any combination of modules of the server 230 or external device 220 may be contained within the server 230, the external device 220 and/or the wearable device 210.
The wearable device 210 may include one or more sensors 211 which are used to record signals 202 form a user. The sensors 211 may include EMG electrodes for recording muscle activity associated with speech, a microphone for recording voiced and/or whispered speech, an accelerometer or IMU for recording vibrations associated with speech and other sensors for recording signals associated with speech. These other sensors may measure a position of a user's tongue, blood flow of the user, muscle strain of the user, muscle frequencies of the user, temperatures of the user, and magnetic fields of the user, among other signals, and may include: photoplethysmogram sensors, photodiodes, optical s sensors, laser doppler imaging, mechanomyography sensors, sonomyography sensors, ultrasound sensors, infrared sensors, Functional near-infrared spectroscopy (fNIRS) sensors, capacitive sensors, electroglottography sensors, electroencephalogram (EEG) sensors, and Magnetoencephalography (MEG) sensors, among other sensors.
The sensors 211 may be supported by the wearable device to record signals 202 associated with speech, either silent or voiced, at or near the head, face and/or neck of the user 201. Once recorded, the signals may be sent to a signal processing module 212 of the wearable device 210. The signal processing module 212 may perform one or more operations on the signals including filtering, thresholding, and analog to digital conversion, among other operations.
The signal processing module 212 may then pass the signals to one or more processors 213 of the wearable device 210. The processors 213 may perform additional processing on the signals including preprocessing, and digital processing. In addition, the processors may utilize one or more machine learning models 214 stored within the wearable device 210 to process the signals. The machine learning models 214 may be used to perform operations including feature extraction, and downsampling, as well as other processes for recognizing one or more words or phrases from signals 202. The processors 213 may process the signals to determine if the user is speaking silently or voiced and may determine one or more words or phrases from the signals and compare these words or phrases to known commands to determine an action the user wishes to perform. In some examples, additionally the processors may process the signals to determine if the user is preparing to speak.
After processing, signals may be sent to communication module 216, which may transmit the signals to one or more external devices or systems. The communication module 216 may perform one or more operations on the processed signals to prepare the signals for transmission to one or more external devices or systems. The signals may be transmitted using one or more modalities, including but not limited to wired connection, Bluetooth, Wi-Fi, cellular network, Ant, Ant+, NFMI and SRW, among other modalities. The signals may be communicated to an external processing device and/or to a server for further processing and/or actions.
The one or more external devices or systems may be any device suitable for processing silent speech signals including smartphones, tablets, computers, purpose-built processing devices, wearable electronic devices, and cloud computing servers, among others. In some examples, the communication module 216 may transmit speech signals directly to a server 230, which is configured to process the speech signals. In some examples, the communication module 216 may transmit speech signals to an external device 220 which processes the signals directly. In some examples, the communication module 216 may transmit speech signals to an external device 220 which, in turn, transmits the signals to server 230 which is configured to process the speech signals. In some examples, the communication module 216 may transmit speech signals to an external device 220 which is configured to partially process the speech signals and transmit the processed speech signals to a server 230 which is configured to complete the processing of the speech signals. The wearable device 210 may receive one or more signals from the external device or the cloud computing system in response to any transmitted signals.
Wearable device 210 may also include control component 215, which is configured to control the other components within the wearable device 210 based at least in part on the signals 202 and the processing performed by signal processing module 212, and processors 213. For example, the control component 215 may function to select sensors 211 for recording of signals, select subsets of a particular sensor such as electrodes for recording, change one or more modes of the wearable device 210, change one or more states of the wearable device 210, activate one or more modules of the wearable device 210 in response to an activation signal or signals, and control the recording of signals among other functions. In some examples the control component 215 may be configured to activate the signal processing module and processors in response to signals recorded by one or more of sensors 211. The control component may activate or device components based on signal indicative of one or more particular words, phrases or actions, which are recognized by the wearable device 210. Such signals may be recognized by the signal processing module 212, processors 213, and/or by using the machine learning models 214.
The external device may contain one or more trained ML models 221 and processors 222. The processors may be configured to recognize one or more words or phrases from the signals received from wearable device 210. The processors may then execute one or more actions on the external device based on the determined words or phrases. For example, applications and functions of the wearable device may be controlled, text inputs may be provided to the external device and communication may be supported by the external device using the signals received from the wearable device, among other actions.
In some examples, the processors 213 of the wearable device 210 may be configured to recognize one or more words or phrases from the recorded signals. The processors may use ML models 214 to recognize the one or more words or phrases. The processors of the wearable device 210 may then execute one or more actions on the wearable device or communicate actions to the external device based on the determined words or phrases. For example, applications and functions of the wearable device may be controlled, text inputs may be provided to the external device and communication may be supported by the external device using the signals received from the wearable device, among other actions. The wearable device may be powered by a battery such as battery 217.
The processors may utilize trained ML models 221 to recognize words or phrases from the signals received from the wearable device 210. The ML models may be structured in any suitable way, such as those described in in U.S. patent application Ser. No. 18/338,827, titled “WEARABLE SILENT SPEECH DEVICE, SYSTEMS, AND METHODS”, filed Jun. 21, 2023, and with attorney docket number W1123.70003US00, the entirety of which is incorporated by reference herein.
The external device 220 may communicate with server 230 to perform one or more actions. The external device 220, and server 230 may be connected via a common network. The server may include cloud computing components 231 to facilitate the connection and communication and a large language model (LLM) 232. The LLM 232 may be used to process words or phrases identified from the speech signals and to determine an action to be performed based on the words or phrases.
Wearable device 300 also includes battery 314, control component 315 and communication component 316. The control component 315 may control the activation of processing components including signal processing component 320 and processors 330, as well as the activity of communication component 316.
Processing components of the wearable device 300 include signal processing components 320 and processors 330. The processing of the device discussed with relation to
Signal processing components may include a threshold detection component 321 and an analog to digital converter (ADC) 322. The threshold detection component 321 may determine if signals received from the microphone 311, EMG sensors 312 or IMU are above or below a threshold signal level and may output signals indicative of which signals are above and/or below the threshold. The threshold detection component 321 may be implemented in a variety of ways, for example as a high pass filter, as a bandpass filter, as a low pass filter, as a diode, or in any other suitable way. In some examples, combinations of different implementations may be used for the threshold detection component, for example a bandpass filter may be applied to signals to remove artifacts, followed by a high pass filter or a lowpass filter to determine if the signal is above or below a respective threshold. The threshold detection component 321 may be connected to control component 315. Control component 315 may determine based on the output of the threshold detection component 321 whether additional processing should occur. If the control component 315 determines that additional processing is necessary, additional processing components may be activated.
The threshold detection component 321 allows the wearable device to differentiate between signals with small magnitudes associated with minor movements of the face which may occur as a result of breathing, chewing, scratching, facial expressions, and blinking, among other actions, and signals with larger magnitudes which are more likely to indicate the user 301 is speaking.
The threshold detection component 321 may function to determine whether the user is silently speaking or speaking out loud based on the signals recorded by sensors including microphone 311, EMG sensor 312 and IMU 313. For example, if the signals received from the microphone 311, EMG sensor 312 and IMU 313 are all above their respective thresholds, the threshold detection component 321 may determine the user is silently speaking, as the microphone 311 detects the voice of the user, the EMG sensor 312 detects the muscle activity associated with the user speaking and the IMU detects vibrations generated during voiced speech. In situations where only the EMG sensor 312 is above a threshold, the device may determine the user is silently speaking as the microphone 311 would not detect sounds associated with voiced speech and the user would not produce vibrations which would be detected by the IMU.
In some examples, the threshold detection component 321 may additionally function to determine whether the used is preparing to speak, based on whether signals recorded by the EMG sensor are above a threshold level.
In some examples, the microphone 311 may include multiple individual microphones. The wearable device may perform beamforming of the signals from the individual microphones to amplify or isolate sounds produced from a location corresponding to the user's mouth. The beamforming of signals produced by multiple microphones in order to generate the output signal for microphone 311, may reduce signal noise associated with sounds other than the user's voice. In some examples, the beamforming may determine a voice level within the microphone 311 signal, indicative of the level of user speech, based on the amplified or isolated sounds from around the user's mouth. In some examples, the user voice level may be compared to a threshold voice level, and the threshold detection component 321 determines the threshold of the microphone based on the determined voice level of the user instead of sounds from the environment the user is in.
In some examples, beamforming may not be performed on signals from the microphone 311 or the environment of the user may be too loud to completely eliminate outside noise from the beamformed signals. In such cases, the IMU 313 may be used to determine if the user is silently speaking or speaking out loud. If the microphone 311. EMG sensor 312 and IMU 313 signals are all above the respective thresholds, it may be determined the user 301 is speaking out loud because the vibrations associated with voiced speech are recorded by the IMU 313. The IMU 313 may record signals associated with glottal vibrations produced during voiced speech, as discussed herein. If the microphone 311 and EMG sensor 312 signals are above respective signals and the IMU 313 signal is below the respective threshold, it may be determined that the user is silently speaking because vibrations associated with voiced speech are not being produced and the signals recorded by the microphone may be associated with a noisy environment.
In some examples, the threshold detection component 321 may analyze the magnitude of the received signals and the duration of the signals to determine if the signals are above a respective threshold. For example, a signal recorded by a sensor of the wearable device 300 may be analyzed by the threshold detection component 321. The threshold detection component 321 may determine the magnitude of the signal is above a set level associated with the specific signal type and/or sensor. The threshold detection component 321 may then monitor the magnitude of the signal and determine the signal is above the set level if the magnitude of the signal remains above the level for a set amount of time. For example, if the magnitude of the signal remains above the set level for 0.5 seconds, 1 second, 2 seconds, 3 seconds, 4 seconds, 5 seconds or any other suitable amount of time, the threshold detection component 321 may determine the signal is above the set level. In some examples, the magnitude of the signal may increase above the set level, however, may drop below the level before the set amount of time has passed. In such examples, the threshold detection component 321 may determine the signal is not above the threshold. In some examples, the threshold detection component 321 may determine whether the signal is above the respective threshold based on the average value of the signal. In such examples, if the average value of the signal remains above the set value for the set amount of time, the threshold detection component may determine the signal is above the respective threshold.
These additional processing components may include processor 330 and ADC 322. The ADC may be activated and convert analog signals recorded from the sensors to digital signals for additional processing. The ADC may pass signals to processor 330. The processor 330 may be configured as one or more processors. The processor 330 may perform one or more levels of processing. For example, as shown the processor includes level 1 processing 331 and level 2 processing 332. The level 1 processing 331 may require less power than the level 2 processing 332 in order to operate.
The processor 330 may function to determine if the recorded signals are associated with the user speaking, either silently or voiced, based on if the user has spoken one or more known words or phrases, if the user has performed one or more known actions and/or the words or phrases spoken by the user. In some examples the processor 330 may additionally function to determine if the recorded signals are associated with the user preparing to speak, for example based on if the user has performed one or more known or recognized actions. In some examples, the different levels of processing within processor 330 may have different functions. For example, the level 1 processing 331 may function to determine if the user is speaking either silently or voiced and the level 2 processing may function to determine if the user spoke one or more known words or phrases, performed one or more known actions or the words or phrases spoken by the user. In some examples, the different levels of processing may have the same functions. For example, the level 1 processing 331 and level 2 processing 332 may both function to determine if the recorded signals are associated with the user speaking, either silently or voiced, if the user has spoken one or more known words or phrases, if the user has performed one or more known actions and/or the words or phrases spoken by the user. Any combination of processing functions may be performed by any level of processing within processor 330.
In some examples, the level 1 processing 331 and level 2 processing 332 may function to determine if the user is speaking or if the user is preparing to speak. The processing, including level 1 processing 331 and level 2 processing 332 may analyze the signals recorded by sensors of the device to determine if the user has performed one or more actions associated with them preparing to speak. Such actions may include the user positioning their tongue in a particular manner, the user positioning their jaw in a particular way, the user inhaling, the user positioning their head, the user opening their mouth or jaw, the user tensing speech articulator muscles, and the user clearing their throat among other actions indicating the user is preparing to speak. These actions may be recognized by level 1 processing 331 and/or by level 2 processing 332, as described herein.
In some examples, the level 1 processing 331 may involve simpler processing than the second level, level 2 332. For example, the first level of processing may include template matching of the received signals to one or more known signals. The one or more known signals may be recorded by the user 301 using the wearable device 300 and stored for later use. The known signals may be associated with a particular word, phrase, expression, or action the user performs. The first level of processing may additionally or alternatively involve using pattern recognition to analyze the signals. The pattern recognition may involve comparing the received signals to known signals or determining one or more features of signals and comparing these features to know signal features associated with particular words, phrases, expressions and/or actions the user may perform. The known signals or features may be predetermined based on recordings from the user or others, using a wearable device such as wearable device 300. The level 1 processing 331 may output one or more signals indicative of a degree of matching between the processed signals recorded from the user 301 and the known signals or signal features. The control component 315 may analyze the outputs of the level 1 processing 331 to determine actions to perform, such as activating level 2 332 of processing or activating communication component 316.
The second level of processing may be more intensive than the first level. Level 2 may involve using one or more trained machine learning (ML) models 340 stored within the wearable device to analyze the signals. The one or more ML models may be used to determine one or more words or phrases spoken silently or voiced by the user. In some examples, the one or more trained ML models may include a trained neural network, such as a convolutional neural network. The trained neural network may be configured to recognize words or phrases or known actions from the recorded signals. The neural network may be a trained convolutional neural network. The ML models may be trained on speech data recorded using a wearable device such as wearable device 300, or other mobile devices as described herein. The training data may include data recorded from EMG sensors, microphones, and IMUs, and may be tagged with known words, phrases or actions which were spoken or performed during the recordings. In some examples, the level 1 processing 331 may implement a ML model.
In some examples, the level 2 processing 332 may perform natural language processing on determined words or phrases to determine an action to be performed by the wearable device. The natural language processing may involve comparing the words or phrases to a set of words or phrases maintained by the device which are associated with different actions. The level 2 processing 332 may provide outputs to the control component 315, which may determine one or more actions to perform, as discussed herein.
In some examples, the level 1 processing 331 involves processing fewer signals than the level 2 processing. For example, the level 1 processing may process signals from a subset of the sensors on the device, such as processing only the signals recorded from the microphone 311 and EMG sensors 312, while the level 2 processing may additionally process the signals from the IMU 313, and/or any other sensors on the wearable device 300. In some examples, the sensors of the mobile device may include multiple channels or multiple sensors. For example, the microphone 311 may include multiple microphones, such as at least 2, at least 3, at least 4 at least 5 or at least 10 microphones. Additionally, the EMG sensor 312 may include multiple electrodes capable of recording EMG signals from the user, for example the EMG sensor may include at least 5, at least 10, at least 20 or at least 50 electrodes. During the level 1 processing 331, a subset of the individual sensors of each of the sensors may be analyzed. For example, if the EMG sensor 312 includes 10 electrodes, 3 of the electrodes may be used in the level 1 processing 331. The level 2 processing 332 may analyze signals from a larger subset of individual sensors of each sensor than the level 1 processing 331 or may analyze signals from all of the individual sensors.
In some examples the ADC 322 may operate differently for the level 1 processing 331 and the level 2 processing 332. For example, the ADC 322 may operate at a lower sampling frequency for level 1 processing 331, and a higher sampling frequency for level 2 processing 332. The lower sampling frequency will result in less data being analyzed during the level 1 processing 331 and therefore will require less power to perform the analog to digital conversion and to perform the level 1 processing 331. The lower sampling frequency used for level 1 processing 331, may result in the loss of data, and therefore additional processing may be performed, such as level 2 processing 332. A portion of the data lost from the lower sampling frequency used for level 1 processing 331 is present in the data output from the ADC 322 when using the higher sampling frequency. In some examples, the lower sampling frequency may be below 50 Hz, between 50-100 Hz, between 50-150 Hz, or between 100-500 Hz. In some examples, the lower sampling frequency may be 50 Hz, 75 Hz, 100 Hz, 150 Hz or 200 Hz. In some examples, the higher sampling frequency may be above 500 Hz, between 500-1 KHz, between 750-1.5 KHz, between 1-5 KHZ, between 1-10 KHz, between 1-20 KHz, between 1-50 KHz, between 16-44 KHz or above 50 KHz. In some examples, the higher sampling frequency may be 750 Hz, 1 KHz, 1.5 KHz, 2 KHz, 16 KHz, and/or 44 KHz. In some examples, signals from different sensors may be sampled at different frequencies. In such cases, different lower and higher sampling frequencies may be used for signals from different sensors. For example, signals from a microphone may be sampled at a first lower and a first higher sampling frequency, signals from an IMU may be sampled at a second lower and a second higher sampling frequency, different from the first sampling frequencies, and the EMG signals may be sampled at a third lower and a third higher sampling frequency, different from the first and second frequencies. In some examples, all signals may be sampled at the same frequencies. In some examples, examples some signals may be sampled at same frequencies, while others are sampled at different frequencies. In some examples, signals from different sensors may be sampled at a same lower sampling frequency and different higher sampling frequencies. For example, the EMG signals, IMU signals and microphone signals may be sampled at a same lower sampling frequency, and the EMG signals and IMU signals are sampled at a first higher sampling frequency and the microphone signals are sampled at a second higher sampling frequency, different from the first higher sampling frequency. In some examples, the EMG signals, IMU signals and microphone signals are sampled at a lower sampling frequency of 100 Hz, and the EMG signals and IMU signals are sampled at a first higher sampling frequency of 1 KHz and the microphone signals are sampled at a second higher sampling frequency between 16-44 KHZ.
In some examples the level 1 processing 331 may involve less intensive processing of signals, processing signals from fewer individual sensors, and processing of signals sampled at lower frequencies than the level 2 processing 332. In some examples, the level 1 processing 331 may involve only less intensive processing of signals than the level 2 processing 332. In some examples, the level 1 processing 331 may involve only processing signals from fewer individual sensors than the level 2 processing 332. In some examples, the level 1 processing 331 may involve processing of signals sampled at lower frequencies than the level 2 processing 332. In some examples, the level 1 processing 331 and level 2 processing 332 involve the same intensity of processing, such as both involving template matching, pattern recognition, and/or the use of machine learning models, on data collected from different numbers of sensors or data collected at different sampling frequencies. In some examples, any combinations of processing intensity, number of sensor signals processed, and sampling frequencies of processed signals may vary between different levels of processing within processor 330.
The control component 315 may function to lower power consumption of the wearable device by preventing the device from activating components when the components are not needed by the user. For example, the control component 315 may prevent components such as the ADC 322 or processor 330 from activating when the user is not intending to use the speech recognition or recording functionality of the wearable device. This will prevent unnecessary draining of battery 314 and will allow the user to continue the use of the device over extended periods of time without needing to remove the device for charging.
Components of the wearable device 300 including signal processing component 320, processor 330, control component 315, communication component 316 and ML models 340, among other components of the wearable device may be implemented in any suitable way. For example, these components may be implemented as separate components within the device. In some examples, these components may be implemented as separate hardware components within the device. In some examples, these components may be implemented as multiple software modules stored within storage of the wearable device. In some examples, these components may be implemented within a single software module within storage of the wearable device. In some examples, these components may be implemented as a combination of hardware and software within the wearable device.
The control component 415 may activate certain components of the devices in response to signals recorded by the sensors, including microphone 411, EMG sensor 412 and IMU 413. The components of
The wearable device may begin in a low power state with the sensors, including microphone 411, EMG sensor 412 and IMU 413, the control component 415 and the threshold detection component 421 being activated. During silent speech, the user 401 may move their mouth as indicated by arrow 402. The EMG sensor 412 may record signals associated with this movement, while the microphone 411 and IMU 413 may not record significant vibration, as the user is silently speaking. The signals recorded by the sensors may be sent to threshold detection component 421, which may determine the signals from the EMG sensor 412 are above a threshold level. The control component 415 may determine the user 401 is likely silently speaking based on the output of the threshold detection component 421. In some examples, the control component may change the state of the device directly to an activated state, by activating the communication component to begin transmitting the recorded signals to external devices, immediately after the signals are determined to be above the threshold. However, in the example shown in
As discussed herein, a processor may have multiple levels of processing which it may perform on the received signals, for example level 1 processing 431 and level 2 processing 432, which may perform the functions of level 1 processing and level 2 processing as discussed in
The first level of processing 431 may involve less intensive processing than level 2 432 or may involve processing which requires less power than level 2 432. Level 1 431 may include template matching of the recorded signals to known signals or pattern recognition of the recorded signals to known signals. This processing may be to determine if the user 401 is speaking one or more known words or phrases or is performing one or more actions which can be recognized by the wearable device. The processing at level 1 431 may also confirm whether the user is silently speaking, speaking out loud, preparing to speak or performing a known action, or if the signals above threshold are due to another action of the user such as sneezing, coughing, chewing, and/or laughing, among other actions. As shown, the level 1 processing may recognize the signals from the user are associated with silent speech 403, however the words or phrases said by the user are not from the known words or phrases and therefore are not recognized at the level 1 processing. Additionally, or alternatively, the level 1 processing 431 may not be able to determine the source of the signals and therefore, additional processing may be performed to determine if the user is silently speaking or not. The control component 415 may control the transition from level 1 processing 431 to level 2 processing 432, as discussed herein.
In some examples, the level 1 processing may determine a probability of whether the user is silently speaking or has performed a known action. If the probability is above a first threshold, for example above 70%, above 75%, above 80%, above 90%, or above 95%, the control component may directly transition the wearable device to the activated state. In some examples, if the level 1 processing determines the probability of the user silently speaking or having performed a known action is below the first probability, but above a second probability, for example above 30%, above 40%, above 50% or above 60%, the control component may activate the level 2 processing 432.
The control component 415 may analyze the results of the level 1 processing 431 and determine one or more actions to be performed. For example, the control component 415 may change the state of the device to the activated state by activating the communication component 416 to begin transmitting signals if a particular word, phrase or action is identified by level 1 processing 431. Alternatively, and as shown, the control component may activate level 2 processing 432 to further confirm the silent speech from the user, for example if the level 1 processing 431 determines recognizes silent speech, but not a particular word or phrase, or if the level 1 processing 431 determines a probability of the user silently speaking or having performed a known action is between a first and second probability. If level 1 processing 431 determines the output is not indicative of a recognized word, phrase, action or silent speech, the control component 415 may not perform any action and will keep level 1 processing activated. Additionally, or alternatively, after a threshold amount of time has passed, the control component 315 may deactivate level 1 processing 431, and ADC 422.
Level 2 processing 432 may be more intensive than level 1 processing and may involve the use of one or more ML models, as discussed herein. As shown, level 2 processing is performed when the device is in the standby 2 state. The level 2 processing 432 may function to determine one or more words or phrases from the signals recorded by the wearable device 400. As shown, the level 2 processing analyzes the signals and determines the silent speech 403 to state “Activate Silent Speech.” Based on this recognition of silently spoken words, the level 2 processing may additionally include performing natural language processing on the words to determine an action intended by the user. The level 2 processing may additionally or alternatively function to determine whether the signals are related to silent speech or are the result of another activity of the user. For example, continued coughing, sneezing and chewing, among other activities, may generate signals above a threshold, which cannot be recognized by the level 1 processing. Therefore, level 2 processing may be used to determine the source of the signals and an output may indicate whether the user is silently speaking or not. The level 2 processing may determine the signals are related to speech if transcription of words or phrases is output based on the signals. The determined words and phrases and/or action may be output to the control component which may perform an action based on the received outputs.
In some examples, the level 1 processing 431 may involve analyzing signals from a subset of the sensors of the wearable device. For example, the level 1 processing 431 may involve analyzing signals from a subset of the EMG electrodes of EMG sensor 412 or may involve analyzing signals recorded from only EMG sensor 412 and not from the microphone 411 or IMU 413. The level 2 processing 432 may then involve analyzing signals recorded from all electrodes of EMG sensor 412 or signals from additional sensors such as microphone 411 or IMU 413. In some examples, the level 1 processing 431 may analyze signals which have been sampled at a first frequency by ADC 422 and the level 2 processing 432 may analyze signals which have been sampled at a second frequency by the ADC 422, the second frequency being higher than the first frequency. Sampling at the higher frequency may require more power than sampling at the lower frequency. The control component 415 may control the sampling rate of the ADC when changing between level 1 and level 2 processing.
The level 1 processing and level 2 processing may involve any combinations as discussed herein. For example, the level 1 processing may involve processing a subset of electrodes of EMG sensor 412, at a lower sampling rate, while the level 2 processing may involve processing signals from all electrodes of EMG sensor 412 and at a higher sampling rate.
As shown the control component activates the communication component when the wearable device transitions to the activated state, such that the wearable device can transmit silent speech signals to external devices. In some examples, the control component may activate one or more other components of the wearable device when the device transitions to the activated state. For example, the control component may activate a processor of the wearable device to begin transcribing the one or more words or phrases spoken by the user. Transcribing the one or more words or phrases may include using one or more machine learning models to analyze signals recorded by the sensors of the wearable device. In some examples, the device may transcribe the words or phrases from the recorded signals and transmit the transcribed signals to a connected device in the activated state. In some examples, the wearable device may not transcribe the recorded signals and may transmit the recorded signals to a connected device for transcription in the activated state.
The control component 215 may determine from the output of level 2 processing 432, that the words or phrases are related to activating the silent speech function of the device and therefore wearable device is activated to begin transmitting data such that the silent speech functionalities may be used. The control component may also activate the wearable device if it is determined in the level 2 processing that the user is silently speaking. If the output of the level 2 processing 432 does not indicate that silent speech is occurring, or that the user did not speak words or phrases indicative of one or more actions, the control component may not activate the device or communication component 416 and may instead keep the level 2 processing active until an output indicates the user is silently speaking or silently spoke words or phrases associated with a particular action. Additionally, or alternatively, the control component 415 may deactivate level 2 processing 431, for example after a threshold amount of time has passed.
During voiced speech, the user 401 may move their mouth as indicated by arrow 402 and produce sounds indicated by sound 404. The microphone 411 may record the sound produced, EMG sensor 412 may record signals associated with the movement of the user's face, and the IMU 413 may record signals associated with vibrations produced during speech. The signals recorded by the sensors may be sent to threshold detection component 421, which may determine the signals from the microphone 411, EMG sensor 412 and IMU 413 are all above a threshold level. The control component 415 may determine the user 401 is likely speaking out loud based on the output of the threshold detection component 421. Based on this output, the control component 415 may directly transition the device to the activated state by activating the communication component to transmit the voiced speech signals to external devices or transcribing the words or phrases spoken by the user, or may, as shown, activate additional signal processing and transition the wearable device 400 to the standby 1 state.
In response to the signals recorded by the microphone 411. EMG sensor 412 and IMU 413 sensors being over the respective thresholds, the control component may activate the level 1 processing 431, as discussed herein. As shown, the level 1 processing 431 recognizes sound 404 to be voiced speech, however particular words and phrases were not recognized, therefore the control component 415 activates level 2 processing 432 in the standby 2 state.
Level 2 processing 432 analyzes the recorded signals and determines that the words the user spoke were “activate voiced speech.” The processing performed in level 2 processing may include analysis by a machine learning model and/or use of natural language processing to determine an action to perform based on the identified words and phrases, and other processes as discussed herein. The control component may determine based on the recognized words and phrases that the user intends to use the voiced speech functionality of the wearable device 400. In response, the control component 415 transitions the wearable device to the activated state by activating the communication component 416, which begins transmitting the recorded signals to external devices. In addition, the control component may put the wearable device into a voiced speech mode, in which components associated with voiced speech are activated and components not used for voiced speech are deactivated. As shown in
In some examples, the wearable device may be configured to switch modes (e.g., to a silent speech mode or a voiced speech mode) in response to the user changing their method of speaking. For example, if a device is in voiced speech mode and determines the user has begun to speak silently the device may change to the silent speech mode. Similarly, if the device is in silent speech mode and determines the user has begun to speak out loud, the device may switch to the voiced speech mode.
The low power state may have only the sensors or subset of sensors, such as the microphone, EMG sensors, IMUs and any other sensors, a threshold detection component such as 421 as discussed regarding
The wearable device may switch to a standby state 520 in response to one or more of the recorded signals increasing to a level above a threshold level. As shown, standby state 520 includes two states, standby 1 state 521 and standby 2 state 522, however in other examples, greater or fewer standby states may be used. For example, 1 standby state, 3 standby states, 4 standby states, 5 standby states or a greater number of standby states may be used. The standby states 520 may involve a greater level of processing than the low power state 510, and therefore will require more power than the low power state 510, however do not require as much power as an activated state 530.
The standby 1 state 521 may involve processing such as template matching and pattern recognition, as discussed with reference to processing level 1 of
If the processing in the standby 1 state 521 only recognizes the signals as speech signals, or recognizes the signals are likely to be speech signals above a threshold probability or between threshold probabilities, the device may then proceed to standby 2 state, as discussed herein. If the signals are not recognized as speech or as being associated with a known word, phrase or action, the device may remain in standby 1 state 521 until a threshold amount of time has passed, and then return to the low power state 510.
The standby 2 state 522 may involve higher level processing than the standby 1 state 521 and therefore may use more power than the standby 1 state 521. For example, the standby 2 state may perform processing associated with the level 2 processing of
In the activated state 530, the device may record signals, and perform processing of the signals. In some examples, the wearable device may transcribe the words or phrases spoken by the user in the activated state. In some examples, the wearable device may transmit the recorded signals or transcribed signals to one or more connected devices in the activated state. In some examples, the processing performed in the activated state is different from the processing performed which controls the states of the device. In some examples, the recorded signals are transmitted directly to the external devices. In some examples, the wearable device remains paired or connected to the external devices in the low power state 510, and/or the standby state 520, however does not transmit data to the external devices. The activated state 530 requires more power than any of the low power state 510, standby 1 state 521, and/or standby 2 state 522, and it is therefore desirable to limit the amount of time the wearable device is in the activated state 530 when the user is not intending to use the functions of the wearable device. This will ensure the battery of the wearable device is conserved and will allow the wearable device to be worn continuously for longer periods of time, without the need for charging.
In some examples, the wearable device may provide an indication to users that a change in state or mode of the wearable device has occurred. For example, the wearable device may play a particular sound over the speaker such as a chime or tone or may play words over the speaker such as “device activated.” In some examples, the wearable device may vibrate to indicate a change in mode or state of the wearable device has occurred. In some examples, the wearable device may activate one or more lights to visibly indicate a change in state or mode has occurred, for example a light may turn on or change color when the device is in the activated state. In some examples, the wearable device may transmit a signal to a connected external device, which results in a notification being displayed or otherwise provided by the external device that a change in state or mode of the wearable device has occurred. In some examples, an indication may be provided for all changes in state or mode of the device. In some examples, an indication may be provided only for changes which relate to the functions of the device, such as a change into the activated state, a change into the low power state, or a change to the voiced speech mode or silent speech mode.
In some examples, the wearable device may be configured to recognize a signal to cancel or stop an action. For example, a user may accidentally activate the activated state of the device or may decide they no longer want to change a state or a mode of the device. In such examples, the user may say a particular word or phrase such as “cancel.” “undo,” or “I didn't mean to do that,” among other words or phrases, and/or may perform an action such as tapping their check or clenching their teeth, among other actions. In response, the wearable device may stop the change in state or mode of the device or may revert the device to the previous state or mode.
In some examples, the threshold for changing states varies according to the state the device is in, in order to reduce latency of the device and ensure swift activation of the device when the user is intending to use the device. For example, the threshold for advancing from the low power state 510 to the standby 1 state 521 may be relatively high compared to other thresholds. This threshold may be set at a relatively high level to prevent false positive activations from occurring, or instances where the user is not intending to activate the device, but the device still activates. This threshold may be determined by recording, with the wearable device, the user when speaking out loud and silently and when performing actions such as chewing, smiling, and frowning, among other actions. A comparison may be made between the recordings made during speaking and the recordings made during other activities, and the threshold may be determined based on the comparison. For example, a threshold maybe selected that ensures all speech recordings or activities associated with speech would result in a state change, while few or none of the other activities would result in a state change. In some examples, a threshold may be chosen such that most or a high percentage such as at least 65%, at least 70%, at least 80%, at least 90%, at least 95%, or at least 99% of the speech recordings result in a state change while few or none of the other activities result in a state change, for example at most 0%, at most 5%, at most 10%, at most 20%, at most 30%, or at most 40% of the other activities.
Relative to the threshold for change from the low power state, a change from the standby 1 state may have a lower threshold. Because the threshold for activation to the standby 1 state is relatively high, it is more likely that the user is intending to use the wearable device. Therefore, when determining whether the user is silently speaking or has spoken one or more know words or phrases or performed one or more known actions, the transition to standby 2 state may occur more often than the transition from the low power state to the standby 1 state. In some examples, the wearable device may utilize a machine learning model to analyze the signals recorded from the wearable device. In some examples, the machine learning model may output a probability distribution including a probability indicative of the likelihood the user has spoken one or more known words or phrases or performed one or more known actions. The state may change to the standby 2 state or the activated state if the probability the user has spoken one or more known words or phrases or performed one or more known actions is above a threshold probability. The threshold probability may be set such that many or most signals result in a state change. For example, the state may change to standby 2 state if the probability is above 10%, above 20%, 30%, above 40%, above 50%, above 60% or above 70%, while the state may change to the activated state if the probability is above 50%, above 60%, above 70%, above 80%, or above 90%.
In the standby 2 state, the machine learning model may also output a probability distribution of the likelihood the user is speaking words or phrases or has performed a speech related action. The state may change to the activated state if the probability of the user speaking words or phrases or having performed a speech related action is above a threshold probability. The threshold probability may be lower relative to the thresholds used for the transition to standby 1 state and or standby 2 state, because the signals have already been analyzed, for example the system may transition to the activated state if the probability is above 10%, above 20%, 30%, above 40%, above 50%, above 60% or above 70%.
The relative decrease in thresholds for the state changes to states which require additional power or processing, for example from the low power state, to standby 1 state, from standby 1 state to standby 2 state or activated state, and/or from the standby 2 state to the activated state, ensures the wearable device responds quickly to user inputs and avoids unnecessary delays in activation due to increased processing times. The relative thresholds may decrease because of the previous processing indicating the signals are likely associated with speech, and therefore the additional processing is used to confirm the results of the previous processing.
In some examples, the wearable device may comprise one or more input sensors, separate from the sensors used for recording signals associated with silent speech, such as inputs 115 of
In some examples, the wearable device may include one or more components such as memory buffers to temporarily store signals recorded by the wearable device. The memory buffers may be implemented as hardware modules or may be implemented as software programs which store the data in a particular location within memory of the wearable device. The memory buffers may store data including signals, processed signals, and filtered signals. The memory buffers may always be activated and may store the signals recorded by the device for during a set amount of time. For example, the memory buffers may store the last 5 seconds of recorded signals, the last 10 seconds of recorded signals, the last 20 seconds of recorded signals, the last 30 seconds of recorded, or the last minute of recorded signals. The memory buffer ensures the content of signals, previously or currently being recorded, is not lost when the device is processing recorded signals. Therefore, if the device determines the user is speaking through processing in the low power state, standby 1 state, and standby 2 state, the signals recorded before and during the processing may be transmitted to connected devices for processing and analysis. The memory buffers may temporarily store all recorded signals, recorded signals above a threshold level, or signals recorded during the standby state.
The wearable device may also automatically change states from states requiring higher power to states requiring lower power. These transitions may occur from the recognition of known signals, after a threshold amount of time has passed, or from a user input to input sensors. The wearable device may change states to a lower power state after a threshold amount of time has passed since a particular action has occurred. For example, the threshold amount of time may be with relation to the wearable device has detected signals from the user, and/or the wearable device has detected speech signals from the user. Any suitable time threshold may be used, for example, up to 5 minutes, up to 4 minutes, up to 3 minutes, up to 2 minutes, up to 1 minute, up to 30 seconds, or up to 20 seconds. The wearable device may transition from the activated state to the standby 2 state, from the standby 2 state to the standby 1 state, and the standby 1 state to the low power state based on a threshold amount of time passing. The wearable device may also transition from the low power state to a fully off state based on a threshold amount of time passing. The threshold amount of time may vary based on the state the device is transitioning to or from.
The wearable device may also transition to lower power states based on known signals recognized from the signals recorded by the sensors of the wearable device. For example, the wearable device may store one or more known signals which are indicative of the user desiring to transition the device to the low power state. In some examples, the wearable device may store signals indicative of the user desiring to power off the device. The wearable device may recognize the signal when in one of the transmitting, standby 2, or standby 1 states, and may directly enter the low power state or power off in response to recognizing the signals. The signals may be associated with one or more known words or phrases or with one or more known actions. For example, the user may speak either out loud or silently “power off.” “low power state,” “switch to low power.” and “turn off,” among other words or phrases. The user may also perform a known action such as tapping their cheek a certain number of times or clenching their jaw, among other actions discussed herein. In some examples, the external device may recognize the known words, phrases or actions from the recorded signals when the device is in the activated state. The external device may transmit a signal to the wearable device that a known word, phrase or action was identified and the wearable device may transition states in response to the signal from the external device.
Process 600 begins at step 601, which involves recording signals indicative of facial movement and actions performed by a user, at a face, head or neck of the user. The signals may be recorded using different sensors as described herein. For example, EMG sensors, microphones or IMUs may be used, as described with reference to
Process 600 then proceeds to step 602, in which the signals recorded in step 601 are analyzed. The analysis of the signals may include analysis as described in
The analyzing may be performed in one or more states of the device, as discussed herein. The states of the device may correspond to the different types of processing and analysis which occurs in the state. For example, initial analyzing may occur in a low power state, while subsequent analyzing may occur in a standby state of the device.
Process 600 then proceeds to step 603, in which the state of the device is changed to an activated state in response to the results of the analyzing of step 602. Changing the device to the activated state may involve activating one or more components of the device such that the recorded signals may be transmitted to an external device which is connected to the wearable device. Changing to the activated state may involve using a communication component of the wearable device, as described with reference to
Other actions may be performed by the wearable device including changing the mode of the wearable device to a voiced speech mode or a silent speech mode. The wearable device may change to a silent speech mode when it is detected the user is speaking silently based on the recorded sensor signals, such as when the microphone and IMUs are below respective thresholds, and the EMG sensor is above the respective threshold. The wearable device may change to a voiced speech mode when it is detected the user is speaking out loud, for example when it is determined the EMG sensor, microphone and IMU are above respective thresholds. The wearable device may also change modes after determining the user has performed a known action, has spoken a known word or phrase associated with changing the mode of the device, and/or has spoken in instruction to change the mode of the device.
Changing the mode of the device in response to the results of the analyzing may involve activating or deactivating certain device components. For example, if the device is in the silent speech mode, the EMG sensors may be activated and the microphone and IMU may be deactivated. In some examples, when in the silent speech mode, the microphone and IMU may be activated while the signals recorded by the microphone and IMU are not analyzed or are partially analyzed. For example, the signals from the microphone and IMU may be analyzed by a threshold detection component, such as discussed with regard to
In some embodiments, wearable device 700 may record silent and/or voiced speech signals of the user from the one or more sensors and transmit the text or encoded features of the user's speech (e.g., obtained from a speech model on the wearable device) to the external device, where the wearable device 700 has a build-in speech model. Alternatively, and/or additionally, the wearable device 700 may record silent and/or voiced speech signals of the user from the one or more sensors and transmit the signals (sensor data) to the external device, where the external device has a speech model to predict text or encoded features using the sensor data, and further provide the predicted text or encoded features to an application to take one or more actions. For example, the external device 710A or 710B may use the text or encoded features from the user's speech (e.g., via the speech model) to control one or more aspects of the connected the external device 710A or 710B. For example, the signals obtained from the one or more sensors (e.g., 711, 706) associated with the user's speech may be used to control a user interface of the connected external device, to control an application of the device, to provide an input to the device, to retrieve information from the device or to access or control one or more additional functions of the device, as discussed herein.
In some embodiments, the sensor data indicating the user's speech muscle activation patterns, e.g., EMG signals, may be collected using a wearable device. The speech model 802 may be trained to use the sensor data to predict text or encoded features. Although it is shown that the EMG signals is associated with the user speaking silently, it is appreciated that the EMG signals may also be associated with the user speaking loudly, or in whisper, and may be used train the speech model to predict the text or encoded features. Thus, domain of the signals used for inference (target domain) and the domain for signals for training the speech model (source domain) may vary, as will be further described.
In some embodiments, training data for the speech model 802 may be associated with a source domain (collection domain). In some embodiments, the source domain may be a voiced domain, where the signals indicating the user's speech muscle activation patterns are collected from voiced speech of training subject(s). In some embodiments, the source domain may be a whispered domain, where the signals indicating the user's speech muscle activation patterns are collected from whispered speech of training subject(s). In some embodiments, the source domain may be a silent domain, where the signals indicating the user's speech muscle activation patterns are collected from silent speech of training subject(s).
As described herein in the present disclosure, voiced (vocal) speech may refer to a vocal mode of phonation in which the vocal cords vibrate during at least part of the speech for vocal phonemes, creating audible turbulence during speech. In a non-limiting example, vocal speech may have a volume above a volume threshold (e.g., 40 dB when measured 10 cm from the user's mouth). In some examples, silent speech may refer to unvoiced mode of phonation in which the vocal cords are abducted so that they do not vibrate, and no audible turbulence is created during speech. Silent speech may occur at least in part while the user is inhaling, and/or exhaling. Silent speech may occur in a minimally articulated manner, for example, with visible movement of the speech articulator muscles, or with limited to no visible movement, even if some muscles such as the tongue are contracting. In a non-limiting example, silent speech has a volume below a volume threshold (e.g., 30 dB when measured about 10 cm from the user's mouth). In some examples, whispered speech may refer to unvoiced mode of phonation in which the vocal cords are abducted so that they do not vibrate, where air passes between the arytenoid cartilages to create audible turbulence during speech.
In some embodiments, the target domain (e.g., a domain used for inference) may preferably be silent domain. In some embodiments, the target domain may be whispered domain. It is appreciated, that the target domain may also be voiced domain or any other domain. In some embodiments, the source domain may be voiced domain, whispered domain, silent domain, or a combination thereof. For example, the training data for the speech model may be collected from both voiced speech and silent speech, each contributing to a respective portion of the training data.
In some embodiments, act 1002 may be performed for an individual user, for a group of users, for one or more collection domains (as described above and further herein), and/or otherwise performed. In some embodiments, training data may be generated in one or more sampling contexts at act 1002. A sampling context may refer to an environment in which the training data is generated. For example, a sampling context may include the training subject being presented with a prompt (e.g., in a data collection center), and speaking the prompt in the source (collection) domain (e.g., voiced, whispered, silent, etc.). The prompt may be text (e.g., a script), audio prompt, and/or any other prompt. In some embodiments, a training system may output the prompt (e.g., display a phrase on a screen, or play an audio prompt in an audio device) to a training subject and ask the training subject to repeat the phrase using voiced speech, whispered speech, and/or silent speech.
In non-limiting examples, the training system may ask the training subject to use voiced speech in one or more voiced speech trials, to use silent speech in one or more silent speech trials, and/or to use whispered speech in one or more whispered speech trials, where each trial corresponds to a single prompt or a set of prompts. In some embodiments, voiced speech trials may be arranged between sets of silent speech trials. For example, a voiced speech trial may be used every K silent speech trial, where K may be in a range of 1-1000, or 5-100, or may be in a range greater than a threshold value, e.g., greater than 1000.
In some embodiments, the training system may provide auditory feedback to improve the accuracy of training data collection, training data labeling, and/or otherwise improve the model training. For example, the auditory feedback may include voice converted from the inferred text from the silent or whispered speech, where the training system may play back the auditory feedback to the training subject during the training data collection.
In some embodiments, prompts in collecting the training data may be segmented. For example, the training subject and/or another person may optionally delineate the start and/or end of each: prompt, sentence within the prompt, word within the prompt, syllable within the prompt, and/or any other segment of the prompt. Additionally, and/or alternatively, auxiliary measurements (e.g., video of the training subject while speaking, inertial measurements, audio, etc.) sampled during test subject speaking may be used to determine the prompt segmentation (e.g., each segment's start and end timestamps).
In some embodiments, a sampling context for generating training data may not include a prompt. Rather, training data may be collected during spontaneous speech. For example, the training data is sampled when the training subject may speak (e.g., voiced, whispered, silent, etc.) and/or perform other actions in their usual environment (e.g., attending meetings, taking phone calls, etc.). In such context, background training data can be collected, where the background training data includes user's speech responsive to operation mode selection by the user (e.g., turning on the device, user indication to interpret the signals, etc.) and/or without operation mode selection by the user (e.g., continuous data collection, automatic data collection responsive to a sensed event, etc.). In some embodiments, background training data collected without explicit prompts may enable training and/or calibrating a personalized speech model, training and/or calibrating a continual (e.g., outside of data collection centers; while all or parts of the system are not in active use for silent speech decoding and/or for controlling a device based on decoded silent speech; etc.), decreasing silent speech decoding errors, and/or providing other advantages.
In some embodiments, sampling context for generating training data may include other scenarios, e.g., the user's action associated with speaking. For example, the sampling context may include user sitting, walking, jumping up and down, or taking other actions when speaking.
In some embodiments, training data may be collected by using one or more measurement systems containing one or more sensors such as described herein, for example using a wearable device as described herein. In some embodiments, the measurement systems may include an electrophysiology measurement system including one or more sensors configured to captured one or more types of signals that indicate the user speech muscle activation patterns associated with the user's speech, e.g., EMG signals, EEG signals, EOG signals, ECG signals, EKG signals, etc. or other suitable biometric measurement systems. In some embodiments, the measurement systems may include one or more of: motion sensor (e.g., IMU), microphone, optical sensors configured to detect the movement of the user's skin (e.g., infrared cameras with a dot matrix projector), video cameras configured to capture images, videos, motion capture data, etc., sensors configured to detect blood flow (e.g., PPG, fNIRS), thermal cameras, depth/distance sensors (e.g., Time of Flight sensors), and/or any other measurement systems. Data collected from a measurement system can correspond to a measurement modality.
In some embodiments, EMG sensors may be placed on a training subject to capture the training data. For example, EMG sensors may be placed at or near any target zones, such as shown in
In some embodiments, training data may be synthetically generated. In some embodiments, training data captured in one domain may be used to generate training data in another domain. For example, synthetic silent domain measurements may be generated by sampling voiced domain measurements and subtracting the glottal vibrations (e.g., determined using an accelerometer, a microphone, etc.). In another example, a model may be trained to generate synthetic silent domain measurements based on voiced domain measurements (e.g., using paired silent and voiced measurements for the same training subject, for the same prompt, etc.). For example, the model can be trained using generative and/or de-noising methods (e.g., Stable Diffusion).
In some embodiments, a relationship between sets of source domain training data generated in different sampling contexts may be used to augment target domain training data. For example, voiced speech training data may include paired examples of a training subject using voiced speech across two or more sampling contexts (e.g., sitting, walking, jumping up and down, other actions, etc.). A mapping function may be inferred between two sampling contexts (e.g., sitting to walking), where the mapping function can be applied to silent speech training data sampled in the first sampling context to generate synthetic silent speech training data in the second sampling context. In some embodiments, synthetic training data may be generated by introducing artifacts and/or otherwise altering sampled training data.
With further reference to
In some examples, ground truth audio signals (e.g., captured from a microphone or a video camera) may be converted to a text speech label (e.g., using ASR or converted manually). In other examples, ground truth videos may be converted to a text speech label (e.g., using automated lip reading or converted manually). For example, facial kinematics may be extracted from a ground truth video of a training subject when speaking during the training data collection. Lip reading may use the extracted facial kinematics to convert the video to a text speech label. Additionally, and/or alternatively, ground truth measurements may be used to validate, correct, and/or otherwise adjust another speech label. For example, a speech label including a prompt text may be corrected based on a ground truth measurement as will be further described in detail with reference to
As shown in
In some embodiments, labeled training data generated in one domain may be corrected by ground truth measurements collected in another domain. For example, as shown in
Returning to
In non-limiting examples, automatic speech recognition (ASR) may be used on sampled speech audio to detect the start/end time for each voiced segment (e.g., word, phrase, etc.), where the start/end time for each voiced segment may be used to determine the training data segment (e.g., EMG measurement) associated with the voiced segment. The ASR may be used concurrently while the speech audio is sampled. Alternatively, the ASR may be used after the speech audio is collected. In other non-limiting examples, lip reading (e.g., extracting facial kinematics from videos captured during the user speaking) may be used to detect the start/end time for each training data segment. The video may be captured using a speech input device having a camera integrated therein, such as on the sensor arm.
It is appreciated that the video may be captured in any other suitable manner, for example, from a camera on a desktop computer facing the user while the user is speaking. In other non-limiting examples, pause detection may be used to detect the start/end time of a training data segment. Pause detection may be applied to sensor data (e.g., speech audio from a microphone, EMG data from an EMG sensor, sensor data from an inertial sensor, etc. collected during a user's speech) to delineate a start/end time of a training data segment. It is appreciated that, the training data segments, which are temporally aligned with speech labels, may be used to train the speech model to predict text from segmented signals associated with user speaking (e.g., EMG signals), such as described in embodiments in
Although embodiments are described for training a speech model using segmented training data, it is appreciated that segmentation of training data may be optional. For example, the speech label may be a text prompt of a phrase, where the training data associated with the user speaking (e.g., voiced, whispered, silently, etc.) may be labeled with the entire text prompt.
With further reference to
Although embodiments of dividing training data into target domain training data and source domain training data are shown in
Having thus described several aspects of at least one embodiment of the technology described herein, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this disclosure and are intended to be within the spirit and scope of disclosure. Further, though advantages of the technology described herein are indicated, it should be appreciated that not every embodiment of the technology described herein will include every described advantage. Some embodiments may not implement any features described as advantageous herein and in some instances one or more of the described features may be implemented to achieve further embodiments. Accordingly, the foregoing description and drawings are by way of example only.
The above-described embodiments of the technology described herein can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component, including commercially available integrated circuit components known in the art by names such as CPU chips, GPU chips, microprocessor, microcontroller, or co-processor. Alternatively, a processor may be implemented in custom circuitry, such as an ASIC, or semicustom circuitry resulting from configuring a programmable logic device. As yet a further alternative, a processor may be a portion of a larger circuit or semiconductor device, whether commercially available, semi-custom or custom. As a specific example, some commercially available microprocessors have multiple cores such that one or a subset of those cores may constitute a processor. However, a processor may be implemented using circuitry in any suitable format.
Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
In this respect, aspects of the technology described herein may be embodied as a computer readable storage medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs (CD), optical discs, digital video disks (DVD), magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments described above. As is apparent from the foregoing examples, a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form. Such a computer readable storage medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the technology as described above. A computer-readable storage medium includes any computer memory configured to store software, for example, the memory of any computing device such as a smart phone, a laptop, a desktop, a rack-mounted computer, or a server (e.g., a server storing software distributed by downloading over a network, such as an app store)). As used herein, the term “computer-readable storage medium” encompasses only a non-transitory computer-readable medium that can be considered to be a manufacture (i.e., article of manufacture) or a machine. Alternatively, or additionally, aspects of the technology described herein may be embodied as a computer readable medium other than a computer-readable storage medium, such as a propagating signal.
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of processor-executable instructions that can be employed to program a computer or other processor to implement various aspects of the technology as described above. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the technology described herein need not reside on a single computer or processor but may be distributed in a modular fashion among a number of different computers or processors to implement various aspects of the technology described herein.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, modules, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in computer-readable media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that conveys relationship between the fields. However, any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
Various aspects of the technology described herein may be used alone, in combination, or in a variety of arrangements not specifically described in the embodiments described in the foregoing and is therefore not limited in its application to the details and arrangement of components set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.
Also, the technology described herein may be embodied as a method, of which examples are provided herein including with reference to
All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B.” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B.” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively.
The terms “approximately” and “about” may be used to mean within +20% of a target value in some embodiments, within +10% of a target value in some embodiments, within +5% of a target value in some embodiments, within +2% of a target value in some embodiments. The terms “approximately” and “about” may include the target value.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
This application is a Continuation-in-part of U.S. application Ser. No. 18/338,827, filed Jun. 21, 2023, entitled “WEARABLE SILENT SPEECH DEVICE, SYSTEMS, AND METHODS”, which is a Non-Provisional of Provisional (35 USC 119(e)) of U.S. Application Ser. No. 63/437,088, filed Jan. 4, 2023, entitled “SYSTEM AND METHOD FOR SILENT SPEECH DECODING”. The entire contents of these applications are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63437088 | Jan 2023 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 18338827 | Jun 2023 | US |
Child | 18487627 | US |