Speech detection from facial skin movements

Description

FIELD OF THE INVENTION

The present invention relates generally to physiological sensing, and particularly to algorithms, methods and systems for sensing silent human speech.

BACKGROUND

The process of speech activates nerves and muscles in the chest, neck, and face. Thus, for example, electromyography (EMG) has been used to capture muscle impulses for purposes of silent speech sensing.

SUMMARY

An embodiment of the present invention that is described hereinafter provides method for generating speech includes uploading a reference set of features that were extracted from sensed movements of one or more target regions of skin on faces of one or more reference human subjects in response to words articulated by the subjects and without contacting the one or more target regions. A test set of features is extracted a from the sensed movements of at least one of the target regions of skin on a face of a test subject in response to words articulated silently by the test subject and without contacting the one or more target regions. The extracted test set of features is compared to the reference set of features, and, based on the comparison, a speech output is generated, that includes the articulated words of the test subject.

In some embodiments, extracting the test features includes extracting the test features without vocalization of the words by the test subject.

In some embodiments, the test subject and at least one of the reference subjects are the same.

In an embodiment, extracting the test set of features includes irradiating the one or more target regions of the skin of the test subject with coherent light, and detecting changes in a sensed secondary coherent light pattern due to reflection of the coherent light from the one or more target regions.

In another embodiment, the uploaded reference set of features and the extracted test set of features each includes a respective waveform calculated for a respective location in a set of locations within the one or more target regions of the skin from a respective time sequence of an energy metric of the sensed secondary coherent light pattern that corresponds to the location.

In some embodiments, comparing the extracted features includes training and applying a machine learning (ML) algorithm to generate the speech output.

In some embodiments, generating the speech output includes synthesizing an audio signal corresponding to the speech output.

In some embodiments, using the speech output, background audio signals are cleaned from a voiced audio signal.

In an embodiment, generating the speech output includes generating text.

In another embodiments, generating the speech output includes, upon failing to distinguish in a given time interval between multiple candidate words with at least a predefined confidence level, generating the speech output for the given time interval by mixing audio of two or more of the candidate words.

In some embodiments, comparing the extracted test set of features to the reference set of features is performed using a trained artificial neural network (ANN), wherein the ANN was trained on a data set collected from a cohort of reference human subjects.

In some embodiments, the method further includes retraining the ANN using data set collected from test subjects.

In some embodiments, the method further includes, using the sensed movements of at least one of the target regions of skin on a face of a test subject, indicating an intent of speech by the test subject.

In an embodiment, the sensed movements are acquired using acquisition rate lower than 200 samples (e.g., frames) per second. In another embodiment, the sensed movements are acquired using acquisition rate between 60 and 140 samples per second.

In general, acquisition sample rate is lower than 200 samples per second, whatever type of signal is being sampled (e.g., coherent light, microwaves, ultrasound waves, etc.).

There is additionally provided, in accordance with another embodiment of the present invention, a method for synthesizing speech, the method including receiving input signals from a human subject that are indicative of intended speech by the human subject. The signals are analyzed to extract words corresponding to the intended speech, such that in at least some time intervals of the intended speech, multiple candidate phonemes are extracted together with respective probabilities that each of the candidate phoneme corresponds to the intended speech in a given time interval. Audible speech is synthesized responsively to the extracted phonemes, such that in the at least some of the time intervals, the audible speech is synthesized by mixing the multiple candidate phonemes responsively to the respective probabilities.

In some embodiments, the input signals include sensed movements of one or more target regions of skin on faces of the human subjects in response to phonemes articulated by the subject and without contacting the one or more target regions.

In some embodiments, the input signals include at least one of signals received by irradiating the one or more target regions of the skin of the test subject with coherent light, with changes being detected in a sensed secondary coherent light pattern due to reflection of the coherent light from the one or more target regions, one or more of optical lip-readings signals, EMG signals, EEG signals, and noisy audio signals.

There is further provided, in accordance with another embodiment of the present invention, a system for generating speech, the system including a memory and a processor. The memory is configured to store a reference set of features that were extracted from sensed movements of one or more target regions of skin on faces of one or more reference human subjects in response to words articulated by the subjects and without contacting the one or more target regions. The processor is configured to (i) upload from the memory the reference set of features, (ii) extract a test set of features from the sensed movements of at least one of the target regions of skin on a face of a test subject in response to words articulated silently by the test subject and without contacting the one or more target regions, and (iii) compare the extracted test set of features to the reference set of features, and, based on the comparison, generate a speech output including the articulated words of the test subject.

In some embodiments, the sensed movements are acquired by optical sensing head and processing circuitry that are fitted inside a stem of wireless headphones.

There is furthermore provided, in accordance with yet another embodiment of the present invention, a system for synthesizing speech, the system including a sensor and a processor. The sensor is configured to receive input signals from a human subject that are indicative of intended speech by the human subject. The processor is configured to (a) analyze the signals to extract words corresponding to the intended speech, such that in at least some time intervals of the intended speech, multiple candidate phonemes are extracted by the processor together with respective probabilities that each of the candidate phoneme corresponds to the intended speech in a given time interval, and (b) synthesize audible speech responsively to the extracted phonemes, such that in the at least some of the time intervals, the audible speech is synthesized by the processor by mixing the multiple candidate phonemes responsively to the respective probabilities.

The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic pictorial illustration of a system for silent speech sensing, in accordance with an embodiment of the invention;

FIG. 2 is a schematic pictorial illustration of a silent speech sensing device, in accordance with another embodiment of the invention;

FIG. 3 is a block diagram that schematically illustrates functional components of a system for silent speech sensing, in accordance with an embodiment of the invention;

FIG. 4 is a flow chart that schematically illustrates a method for silent speech sensing, in accordance with an embodiment of the invention;

FIG. 5 is a flow chart that schematically illustrates a method for training an Artificial Neural Network (ANN) to perform silent speech deciphering, in accordance with an embodiment of the invention;

FIG. 6 is a flow chart that schematically illustrates a method of using a trained ANN in inference to perform silent speech deciphering, in accordance with an embodiment of the invention;

FIG. 7 is a flow chart that schematically illustrates a method of preprocessing silent speech sensed data in preparation for speech deciphering, in accordance with an embodiment of the invention; and

FIG. 8 is a flow chart that schematically illustrates a method of generating an ambiguous speech output for an ambiguous silent speech input, in accordance with an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

The widespread use of mobile telephones in public spaces creates audio quality issues. For example, when one of the parties in a telephone conversation is in a noisy location, the other party or parties may have difficulty in understanding what they are hearing due to background noise. Moreover, use in public spaces often raises privacy concerns, since conversations are easily overheard by passersby.

The human brain and neural activity are complex phenomena that involves many human subsystems. One of those subsystems is the facial region which humans use to communicate with others. It is an innate activity that conveys several levels of meaning. At the core, humans communicate with language. The formation of concepts is closely related to the formation of words and then their language-dependent sounds. Humans train to articulate sounds from birth. Even before full language ability evolves, babies use facial expressions, including micro-expressions, to convey deeper information about themselves. The combined interaction with a person provides another value which is trust. While trust in someone begins with their appearance so that we know who we are talking with, their reactions can provide further trust that the person is no incapacitated.

In the normal process of vocalization, motor neurons activate muscle groups in the face, larynx, and mouth in preparation for propulsion of air flow out of the lungs, and these muscles continue moving during speech to create words and sentences. Without this air flow, no sounds are emitted from the mouth. Silent speech occurs when the air flow from the lungs is absent, while the muscles in the face, larynx, and mouth continue to articulate the desired sounds. Silent speech can thus be intentional, for example when one articulates words but does not wish to be heard by others. This articulation can occur even when one conceptualizes spoken words without opening our mouths. The resulting activation of our facial muscles gives rise to minute movements of the skin surface. The present disclosure builds on a system for sensing neural activity, the detection focused on the facial region, which allows the readout of residual of muscular activation of the facial region. These muscles are involved in the inter-human communication such as production of sounds, facial expressions, (including micro-expressions), breathing and other signs humans use for inter-person communication.

Embodiments of the present invention that are described herein enable users to articulate words and sentences without actually vocalizing the words or uttering any sounds at all. The inventors have found that by properly sensing and decoding these movements, it is possible to reconstruct reliably the actual sequence of words articulated by the user.

In some embodiments, a system comprising a wearable device and dedicated software tools deciphers data sensed from fine movements of the skin and subcutaneous nerves and muscles on a subject's face, occurring in response to words articulated by the subject with or without vocalization, and use the deciphered words in generating a speech output including the articulated words. Details of devices and methods used in sensing the data from fine movements of the skin, are described in the above-mentioned International Patent Application PCT/IB2022/054527.

The disclosed deciphering techniques enable users to communicate with others or to record their own thoughts silently, in a manner that is substantially imperceptible to other parties and is also insensitive to ambient noise.

Some embodiments use sensing devices having the aforementioned form of common consumer wearable items, such as a clip-on headphone or spectacles. In these embodiments, an optical sensing head is held in a location in proximity to the user's face by a bracket that fits in or over the user's ear. The optical sensing head senses coherent light reflected from the face, for example by directing coherent light toward an area of the face, such as the cheek, and sensing changes in the coherent light pattern that arises due to reflection of the coherent light from the face. Processing circuitry in the device processes the signal output by the optical sensing head due to the reflected light to generate a corresponding speech output. In one embodiment, the optical sensing head and processing circuitry is fitted inside a stem of wireless headphones, such as AirPods. In that embodiment, sensing is slightly further away from the sensed skin location and the sensor's viewing angle is typically narrow.

Alternatively, the disclosed deciphering technique can be used with a silent speech sensing module, including a coherent light source and sensors, may be integrated into a mobile communication device, such as a smartphone. This integrated sensing module senses silent speech when the user holds the mobile communication device in a suitable location in proximity to the user's face.

In one example, deciphering of silent speech is performed using a machine learning (ML) algorithm, such as a trained artificial neural network (ANN). In this example image processing software converts acquired signals into preprocessed signals, and the trained ANN specifies speech words contained in the preprocessed signals. Different types ANNs may be used, such as a classification NN that eventually outputs words, and a sequence-to-sequence NN which outputs a sentence (word sequence). To train the ANNs, at least several thousands of examples should typically be gathered and augmented, as described above. This “global” training, that relies on a large group of persons (e.g., a cohort of reference human subjects), allows later for a device of a specific user to perform fine adjustments of its deciphering software. In this manner, within minutes or less of wearing the device and turning on the application, the system (e.g., mobile phone and the wearable deice) is ready for deciphering.

In many cases, speech recognition algorithms output some ambiguous results, as described below. In the case of Human-Human real time communication implantation, waiting for the sentence to complete before synthesizing the text to speech to mitigate the ambiguity will result with a significant delay that might not be acceptable. To solve this issue, the disclosed speech synthesizer is configured to quickly generate an ambiguous output for an ambiguous input, so as not to disrupt the natural flow of conversation. The ambiguity itself may still be resolved at a later stage. In some examples, a processor is used for synthesizing speech, by performing the steps of (i) receiving input signals from a human subject that are indicative of intended speech by the human subject, (ii) analyzing the signals to extract words corresponding to the intended speech, such that in at least some time intervals of the intended speech, multiple candidate phonemes are extracted together with respective probabilities that each of the candidate words corresponds to the intended speech in a given time interval, and (iii) synthesizing audible speech responsively to the extracted phonemes, such that in the at least some of the time intervals, the audible speech is synthesized by mixing the multiple candidate phonemes responsively to the respective probabilities.

As described below, to perform step (i) the processor may run an image processing software, and to perform step (ii) the processor may run a neural network. To perform step (iii) the processor may use a voice synthesizer.

The disclosed technique can be used, using the sensed movements of at least one of the target regions of skin on a face of a test subject, measuring the amount of neural activity, to indicate an intent of speech by the test subject even before such speech ever occurred. Finally, in another embodiment, the disclosed technique improves audio quality, for example, of conversations made by mobile telephones in loud public spaces, by cleaning (e.g., removing background signals) from audio.

System Description

FIG. 1 is a schematic pictorial illustration of a system 18 for silent speech sensing, in accordance with an embodiment of the invention. System 18 is based on a sensing device 20, in which a bracket, in the form of an ear clip 22, fits over the ear of a user 24 of the device. An earphone 26 attached to ear clip 22 fits into the user's ear. An optical sensing head 28 is connected by a short arm 30 to ear clip 22 (e.g., an AirPod) and thus is held in a location in proximity to the user's face. In the pictured embodiment, device 20 has the form and appearance of a clip-on headphone, with the optical sensing head in place of (or in addition to) the microphone.

Details of device 20, such as of interface and processing circuitries comprised in device 20, are described in the above-mentioned International Patent Application PCT/IB2022/054527.

Optical sensing head 28 directs one or more beams of coherent light toward different, respective locations on the face of user 24, thus creating an array of spots 32 extending over an area 34 of the face (and specifically over the user's cheek). In the present embodiment, optical sensing head 28 does not contact the user's skin at all, but rather is held at a certain distance from the skin surface. Typically, this distance is at least 5 mm, and it may be even greater, for example at least 1 cm or even 2 cm or more from the skin surface. To enable sensing the motion of different parts of the facial muscles, the area 34 covered by spots 32 and sensed by optical sensing head 28 typically has an extent of at least 1 cm²; and larger areas, for example at least 2 cm²or even greater than 4 cm², can be advantageous.

Optical sensing head 28 senses the coherent light that is reflected from spots 32 the face and outputs a signal in response to the detected light. Specifically, optical sensing head 28 senses the secondary coherent light patterns that arise due to reflection of the coherent light from each of spots 32 within its field of view. To cover a sufficiently large area 34, this field of view typically has a wide angular extent, typically with an angular width of at least 60°, or possibly 70° even 90° or more. Within this field of view, device 20 may sense and process the signals due to the secondary coherent light patterns of all of spots 32 or of only a certain subset of spots 32. For example, device 20 may select a subset of the spots that is found to give the largest amount of useful and reliable information with respect to the relevant movements of the skin surface of user 24.

Within system 18, processing circuitry processes the signal that is output by optical sensing head 28 to generate a speech output. As noted earlier, the processing circuitry is capable of sensing movements of the skin of user 22 and generating the speech output, even without vocalization of the speech or utterance of any other sounds by user 22. The speech output may take the form of a synthesized audio signal or a textual transcription, or both. In that regard, the silent speech detection can be readily implemented as nerve-to-text application, such as, for example, directly transcribing silent speech into an email draft. The synthesized audio signal may be played back via the speaker in earphone 26 (and is useful in giving user 22 feedback with respect to the speech output). Additionally or alternatively, the synthesized audio signal may be transmitted over a network, for example via a communication link with a mobile communication device, such as a smartphone 36. Typically, the synthesis is done at different times than a voiced utterance would happen. This timing can be shorter or longer, and the processer can find the timing difference. Such timing difference may be utilized, as an example, when the synthesized voice is ready earlier than the voiced utterance would happen, to provide a translation of the synthesized voice into another language, with the translated utterance outputted on the time the voiced utterance would.

The functions of the processing circuitry in system 18 may be carried out entirely within device 20, or they may alternatively be distributed between device 20 and an external processor, such as a processor in smartphone 36 running suitable application software. For example, the processing circuitry within device 20 may digitize and encode the signals output by optical sensing head 28 and transmit the encoded signals over the communication link to smartphone 36. This communication link may be wired or wireless, for example using the Bluetooth™ wireless interface provided by the smartphone. The processor in smartphone 36 processes the encoded signal in order to generate the speech output. Smartphone 36 may also access a server 38 over a data network, such as the Internet, in order to upload data and download software updates, for example. Details of the design and operation of the processing circuitry are described hereinbelow with reference to FIG. 3.

In the pictured embodiment, device 20 also comprises a user control 35, for example in the form of a push-button or proximity sensor, which is connected to car clip 22. User control 35 senses gestures performed by user, such as pressing on user control 35 or otherwise bringing the user's finger or hand into proximity with the user control. In response to the appropriate user gesture, the processing circuitry changes the operational state of device 20. For example, user 24 may switch device 20 from an idle mode to an active mode in this fashion, and thus signal that the device should begin sensing and generating a speech output. This sort of switching is useful in conserving battery power in device 20. Alternatively or additionally, other means may be applied in controlling the operational state of device 20 and reducing unnecessary power consumption, for example as described below with reference to FIG. 5. Moreover, a processor or of device 20 can automatically switch from idle mode to high power consumption mode based on differing trigger types, such as a sensed input (e.g., eye blinks or mouth slightly open, or a pre-set sequence of motions like tongue movement). Also, the user may activate the device, using, for example, a touch button on the device, or from an application in a mobile phone.

In an optional embodiment, a microphone (not shown), may be included, to senses sound uttered by user 24, enabling user 22 to use device 20 as a conventional headphone when desired. Additionally or alternatively, the microphone may be used in conjunction with the silent speech sensing capabilities of device 20. For example, the microphone may be used in a calibration procedure, in which optical sensing head 28 senses movement of the skin while user 22 utters certain phonemes or words. The processing circuitry may then compare the signal output by optical sensing head 28 to the sounds sensed by a microphone (not shown) in order to calibrate the optical sensing head. This calibration may include prompting user 22 to shift the position of optical sensing head 28 in order to align the optical components in the desired position relative to the user's cheek.

FIG. 2 is a schematic pictorial illustration of a silent speech sensing device 60, in accordance with another embodiment of the invention. In this embodiment, ear clip 22 is integrated with or otherwise attached to a spectacle frame 62. Nasal electrodes 64 and temporal electrodes 66 are attached to frame 62 and contact the user's skin surface. Electrodes 64 and 66 receive body surface electromyogram (sEMG) signals, which provide additional information regarding the activation of the user's facial muscles. The processing circuitry in device 60 uses the electrical activity sensed by electrodes 64 and 66 together with the output signal from optical sensing head 28 in generating the speech output from device 60.

Additionally or alternatively, device 60 includes one or more additional optical sensing heads 68, similar to optical sensing head 28, for sensing skin movements in other areas of the user's face, such as eye movement. These additional optical sensing heads may be used together with or instead of optical sensing head 28.

FIG. 3 is a block diagram that schematically illustrates functional components of system 18 for silent speech sensing, in accordance with an embodiment of the invention. The pictured system is built around the components shown in FIG. 1, including sensing device 20, smartphone 36, and server 38. Alternatively, the functions illustrated in FIG. 3 and described below may be implemented and distributed differently among the components of the system. For example, some or all of the processing capabilities attributed to smartphone 36 may be implemented in sensing device; or the sensing capabilities of device 20 may be implemented in smartphone 36.

Sensing device 20 transmits the encoded signals via a communication interface of the device, such as a Bluetooth interface, to a corresponding communication interface 77 in smartphone 36. In the present embodiment, the encoded output signals from sensing device 20 are received in a memory 78 of smartphone 36 and processed by a speech generation application 80 running on the processor in smartphone 36. Speech generation application 80 converts the features in the output signal to a sequence of words, in the form of text and/or an audio output signal. Communication interface 77 passes the audio output signal back to speaker 26 of sensing device 20 for playback to the user. The text and/or audio output from speech generation application 80 is also input to other applications 84, such as voice and/or text communication applications, as well as a recording application. The communication applications communicate over a cellular or Wi-Fi network, for example, via a data communication interface 86.

The encoding operations of by device 20 and speech generation application 80 are controlled by a local training interface 82. For example, interface 82 may indicate to a processor of device 20 which temporal and spectral features to extract from the signals output by receiver module 48 and may provide speech generation application 80 with coefficients of a neural network, which converts the features to words. In the present example, speech generation application 80 implements an inference network, which finds the sequence of words having the highest probability of corresponding to the encoded signal features received from sensing device 20. Local training interface 82 receives the coefficients of the inference network from server 38, which may also update the coefficients periodically.

To generate local training instructions by training interface 82, server 38 uses a data repository 88 containing coherent light (e.g., speckle) images and corresponding ground truth spoken words from a collection of training data 90. Repository 88 also receives training data collected from sensing devices 20 in the field. For example, the training data may comprise signals collected from sensing devices 20 while users articulate certain sounds and words (possibly including both silent and vocalized speech). This combination of general training data 90 with personal training data received from the user of each sensing device 20 enables server 38 to derive optimal inference network coefficients for each user.

Server 38 applies image analysis tools 94 to extract features from the coherent light images in repository 88. These image features are input as training data to a neural network 96, together with a corresponding dictionary 104 of words and a language model 100, which defines both the phonetic structure and syntactical rules of the specific language used in the training data. Neural network 96 generates optimal coefficients for an inference network 102, which converts an input sequence of feature sets, which have been extracted from a corresponding sequence of coherent light measurements, into corresponding phonemes and ultimately into an output sequence of words. Server 38 downloads the coefficients of inference network 102 to smartphone 36 for used in speech generation application 80.

Method For Speech Sensing

FIG. 4 is a flow chart that schematically illustrates a method for silent speech sensing, in accordance with an embodiment of the invention. This method is described, for the sake of convenience and clarity, with reference to the elements of system 18, as shown in FIGS. 1 and 4 and described above. Alternatively, the principles of this method may be applied in other system configurations, for example using sensing device 60 (FIG. 2) or a sensing device that is integrated in a mobile communication device.

As long as user 24 is not speaking, sensing device 20 operates in a low-power idle mode in order to conserve power of its battery, at an idling step 410. This mode may use a low frame rate, for example twenty frames/sec. While device 20 operates at this low frame rate, it processes the images to detect a movement of the face that is indicative of speech, at a motion detection step 112. When such movement is detected, a processor of device 20 instructs to increase the frame rate, for example to the range of 100-200 frames/sec, to enable detection of changes in the secondary coherent light (e.g., speckle) patterns that occur due to silent speech, at an active capture step 414. Alternatively or additionally, the increase the frame rate may follow instructions received from smartphone 36.

A processor of device 20 then extracts features of optical coherent light pattern motion, at a feature extraction step 420. Additionally or alternatively, the processor may extract other temporal and/or spectral features of the coherent light in the selected subset of spots. Device 20 conveys these features to speech generation application 80 (running on smartphone 36), which inputs vectors of the feature values to the inference network 102 that was downloaded from server 38, at a feature input step 422.

Based on the sequence of feature vectors that is input to the inference network over time, speech generation application 80 outputs a stream of words, which are concatenated together into sentences, at a speech output step 424. As noted earlier, the speech output is used to synthesize an audio signal, for playback via speaker 26. Other applications 84 running on smartphone 36 post-process the speech and/or audio signal to record the corresponding text and/or to transmit speech or text data over a network, at a post-processing step 426.

Deciphering of Detected Silent Speech

As described above, the deciphering of silent speech (i.e., analyzing acquired signals to extract words corresponding to an intended speech) is performed by a chain of software tools, such as image processing software (e.g., tool 94), and an artificial neural network (ANN), such as NN 96. The image processing software converts acquired signals into preprocessed signals, and the ANN specifies intended speech words contained in the preprocessed signals. This section provides examples of deciphering methods and software tools that the disclosed technique may use. It covers training and inference phases by an ANN (FIGS. 5 and 6, respectively), as well as the preprocessing phase (FIG. 7).

FIG. 5 is a flow chart that schematically illustrates a method for training an ANN to perform silent speech deciphering, in accordance with an embodiment of the invention. This method can be used to train, for example, two different ANN types: a classification neural network that eventually outputs words, and a sequence-to-sequence neural network which outputs a sentence (word sequence). The process begins in data uploading step 502, with uploading from a memory of server 38 pre-processed training data, such as outputted by image analyses tool 94, that was gathered from multiple reference human subjects, e.g., during a development.

The silent speech data is collected from a wide variety of people (people of varying ages, genders, ethnicities, physical disabilities, etc.). The number of examples required for learning and generalization is task dependent. For word/utterance prediction (within a closed group) at least several thousands of examples were gathered. For the task of word/phoneme sequence prediction, the dataset size is measured in hours, and several thousands of hours were gathered for transcription.

In data augmentation step 504, a processor augments the image processed training data to get more artificial data for the training process. In particular, the input here is an image processed secondary coherent light pattern, with some of the image processing steps described below. Step 504 of data augmentation may include the sub-steps of (i) Time dropout, where amplitudes at random time points are replaced by zeros, (ii) Frequency dropout—the signal is transformed into the frequency domain. Random frequency chunks are filtered out. (iii) Clipping, where the maximum amplitude of the signal at random time points is clamped. This adds a saturation effect to the data, (iv) Noise addition, where gaussian noise is added to the signal, and speed change, where the signal is resampled to achieve a slightly lower or slightly faster signal.

At features extraction step 506, the augmented dataset goes through the feature extraction module. In this step the processor computes time domain silent speech features. For this purpose, for example, each signal is split into low and high frequency components, x_lowand x_high, and windowed to create time frames, using a frame length of 27 ms and shift of 10 ms. For each frame we compute five time-domain features and nine frequency domain features, a total of 14 features per signal. The time-domain features are as follows:

$[\frac{1}{n} \sum_{i} {(x_{low} [i])}^{2}, \frac{1}{n} \sum_{i} x_{l o w} [i], \frac{1}{n} \sum_{i} {(x_{h i g h} [i])}^{2}, \frac{1}{n} \sum_{i} ❘ x_{h i g h} [i] ❘, ZCR (x_{h i g h})]$

where ZCR is the zero-crossing rate. In addition, we use magnitude values from a 16-point short Fourier transform, i.e., frequency domain features. All features are normalized to zero mean unit variance.

For ANN training step 508, the processor split the data into training, validation and test sets. The training set is the data used to train the model. Hyperparameter tuning is done using the validation set, and final evaluation is done using the test set.

The model architecture is task dependent. Two different examples describe training two networks for two conceptually different tasks. First is the signal transcription. i.e., translating silent speech to text by word/phoneme/letter generation. This task is addressed by using a sequence-to-sequence model. The second task is word/utterance prediction. i.e., categorizing utterances uttered by users into a single category within a closed group. It is addressed by using a classification model.

The disclosed sequence-to-sequence model is composed of an encoder, which transforms the input signal into high level representations (embeddings), and a decoder, which produces linguistic outputs (i.e., characters or words) from the encoded representations. The input entering the encoder is a sequence of feature vectors, as described in the “feature extraction” module. It enters the first layer of the encoder—a temporal convolution layer, which down samples the data to achieve a good performance. The model may use an order of hundred such convolution layers.

Outputs from the temporal convolution layer at each time step are passed to three layers of bidirectional recurrent neural networks (RNN). The processor employs long short-term memory (LTSM) as units in each RNN layer. Each RNN state is a concatenation of the state of the forward RNN with the state of the backward RNN. The decoder RNN is initialized with the final state of the encoder RNN (concatenation of the final state of the forward encoder RNN with the first state of the backward encoder RNN). At each time step, it gets as input the preceding word, encoded one-hot and embedded in a 150-dimensional space with a fully connected layer. Its output is projected through a matrix into the space of words or phonemes (depending on the training data).

The sequence-to-sequence model conditions the next step prediction on the previous prediction. During learning, a log probability is maximized:

$\max_{θ} \sum_{i} \log P (y_{i} ❘ x, y_{< i}^{*}; θ)$

where y<i is the ground truth of the previous prediction. The classification neural network is composed of the encoder as in the sequence-to-sequence network and an additional fully connected classification layer on top of the encoder output. The output is projected into the space of closed words and the scores are translated into probabilities for each word in the dictionary.

The results of the above entire procedure are two types of trained ANNs, expressed in computed coefficients for an inference network 102. The coefficients are stored (step 510) in a memory of server 38.

In day-to-day use, training interface 82 receives up to date coefficients of inference network 102 from server 38, where server 38 may also update the coefficients periodically, the coefficients of inference network 102 are stored in a memory of earpiece device 20 or in memory 78 of smartphone 36. First ANN task is the signal transcription. i.e., translating silent speech to text by word/phoneme/letter generation. The second ANN task is word/utterance prediction, i.e., categorizing utterances uttered by users into a single category within closed group. These networks are plugged in the system to work as part of it at he below described in FIG. 6.

Finally, the training session is used for optimizing a selection and priority of locations of secondary coherent light on the face of user 24 to analyze. In selection updating step 512, the processor updates a list and order of use of such locations.

FIG. 6 is a flow chart that schematically illustrates a method of using a trained ANN in inference to perform silent speech deciphering, in accordance with an embodiment of the invention. Such trained ANN may be inference network 102. The process begins with, for example, the processor in smartphone 36 running suitable application software that uploads inference network 102, at trained ANN uploading step 602.

At a silent speech preprocessing step 604, a processor of sensing device 20 receives silent speech signals and preprocess this using, for example, an image processing software included in device 20.

At silent speech feature extraction step 606, the processor of sensing device 20 extracts from the preprocessed silent speech signals silent speech features, as described in FIG. 7.

At silent speech features receiving step 608, smartphone 36 receives encoded signals via a communication interface 77. In the present embodiment, the encoded signals of step 606 from sensing device 20 are received in a memory 78 of smartphone 36.

At a silent speech inference step 610, the extracted features are processed by a speech generation application 80 running on the processor in smartphone 36. Speech generation application 80 runs inference network 102 to converts the features in the output signal to a sequence (612) of words. These words may be subsequently outputted in a form of text and/or an audio output signal (e.g., using voice synthesizer).

FIG. 7 is a flow chart that schematically illustrates a method of preprocessing silent speech sensed data in preparation for speech deciphering, in accordance with an embodiment of the invention. The process begins by a processor of device 20 receiving a frame from a camera of device 20 that captured at secondary coherent light reflections from the cheek skin area at high frame rate (e.g., 500 fps), at a frame receiving step 702.

For each frame the raw image is transferred to an image processing algorithm that extracts the skin motion at a set of pre-selected locations on the user's face. The number of locations to inspect is an input to the algorithm. The locations on the skin that are extracted for coherent light processing are taken from a predetermined list that a processor uploads (704) from memory. The list specifies anatomical locations, for example: cheek above mouth, chin, mid-jaw, check below mouth, high cheek and back of check. Furthermore, the list is dynamically updated with more points on the face that are extracted during the training phase (in step 512 of FIG. 2). The entire set of locations is ordered in descending order such that any subset of the list (in order) minimizes the word error rate (WER) with respect to the chosen number of locations that are inspected.

At a coherent light spot selection step 706, the processor selects the locations to analyze according to the list provided in step 704.

At a cropping step 708, the processor crops each of the coherent light spots that were extracted in the frame around the coherent light spots, and the algorithm process the spot. Typically, the process of coherent light spot processing involves reducing by two order of magnitude a size of full frame image pixels (of ˜1.5 MP) that is taken with the camera, with a very short exposure. Exposure is dynamically set and adapted to be able to capture only coherent light reflections and not skin segments. For day light and green laser this is found to be around 1/4000 seconds. As the image is mostly empty (e.g., of cheek skin—being black regions) and includes the laser point that forms a secondary coherent light pattern. In the preprocessing phase, the laser point (e.g., speckle) region is identified, and the image is cropped, so that the algorithms run only on this region. For example, the processor reduces the full image (1.5 MP) to a 18 KP image, which immediately accelerates processing time for the remainder of the algorithm.

The image processing algorithm may be executed on a CPU, GPU, or hardware logic that is embedded within or adjacent to the camera sensor chip—to be able to eliminate the flow of high bandwidth data in the system. This may reduce the total power consumption of the device. Moreover, the preprocessed data corresponds to the physical vibrations and movements of the user's facial skin (in each location), thus is at much lower bandwidth compared to the raw images—few hundred of samples per second for each location.

Once the image processor has identified the region of interest within the coherent light spot, it improves the image contrast, by removing noise using a threshold s to determine black pixels and then computes (710) a characteristic metric of the coherent light, such as scalar speckle energy measure, e.g., an average intensity. Step 710 includes steps described in box 711, which includes, after identifying the coherent light pattern at step 706 and cropping it at step 708, further reducing the image to a predefined fraction (e.g., ⅓) of the radius of the coherent light spot (which amounts to reducing the aforementioned 18 KP image to only 2 KP, upon which the metric is calculated, e.g., as average intensity of the 2 KP pixels.

Analyzing changes in time in the measure (e.g., in average speckle intensity) by the processor is one example of detection of changes in the secondary coherent light patterns. Alternatively, other metrics may be used such as the detection of specific coherent light patterns. Finally, a sequence of values of this scalar energy metric is calculated frame-by-frame and aggregated (712), giving a 1 D temporal signal.

The 1 D temporal signals are the preprocessed signals that area stored for use in silent speech deciphering, as described above in FIGS. 5 and 6.

The accuracy of the word detection process described above is optimized using a combination of the following concepts:

1) Personalized Algorithm Parameters.

During normal speech of the user, the system simultaneously samples the user's voice and the facial movements. Automatic speech recognition (ASR) and Natural Language Processing (NLP) algorithms are applied on the actual voice, and the outcome of these algorithms is used for optimizing the parameters of the motion to language algorithms. These parameters include the weights of the various neural networks, as well as the spatial distribution of laser beams for optimal performance. For subjects with speech disorders who have intact nervous systems and muscle fibers, transfer learning techniques can be used in applying results acquired from subjects who are capable of speech.

2) Limiting the Word Set.

Limiting the output of the algorithms to a pre-defined word set significantly increases the accuracy of word detection in cases of ambiguity—where 2 different words result in similar movements on the skin.

The used word set can be personalized over time, adjusting the dictionary to the actual words used by the specific user, with their respective frequency and context.

3) Context Optimized Word Set.

Including the context of the conversation in the input of the words and sentences extraction algorithms increases the accuracy by eliminating out-of-context options. The context of the conversation is understood by applying Automatic speech recognition (ASR) and Natural Language Processing (NLP) algorithms on the other side's voice.

Voice Synthesis

The information that is extracted from the inner/silent speech can be used in various ways, e.g., 1) human—machine communication (e.g., personal assistant/“Alexa” type devices) 2) human—human communication (e.g., phone calls).

For human-human communication the system generates a synthetic voice of the user based on the inner speech and transmits this synthetic voice to the other side's device. Alternatively, human-human communication can be made via 3rd party applications such as instant messaging apps, in which case, the inner speech is converted into text and transmitted to the other side's device.

In many cases, speech recognition algorithms result with some ambiguous result. For example:

The user says the word “Justice”, the classification algorithm predicts that with 50% certainty the articulated word was “justice”, 30% “practice” and 20% “lattice”. In other implementations of NLP algorithms, the algorithm selects the right word based on the context of the whole sentence that in many cases is revealed only after the ambiguous word has been said.

In the case of Human-Human real time communication implantation, waiting for the sentence to complete before synthesizing the text to speech will result with a significant delay that might not be acceptable.

To solve this issue, the speech synthesizer is configured to generate an ambiguous output for an ambiguous input. Examples of ambiguous input are confusing words/phonemes. For example, the system may not fully determine whether the user said “down” or “town”. The unit of work in this case is therefore the sound (called “phoneme”) and not an entire word. In response, the system generates and transmits speech that is a mixture of the two candidate phonemes. In the above example, if the system is not certain whether the user said “down” or “town” then the resulting sound will be a mixture of “t” and “d” at the appropriate times.

To further illustrate the concept, for the above example constituting of the words “justice”, “practice” and “lattice,” the synthesizer will not send a clear “justice” word, although it is the option with the highest probability, but instead, the algorithm will create a sound that is a combination of articulated words weighted as being probably 50% “justice”, 30% “practice” and 20% “lattice”—same probabilities as the input. This implementation will transfer the ambiguity to the other person's brain—to be resolved at a later time, after the sentence is complete.

The algorithm for generating ambiguous words comprises two steps:

- a) Time scaling to make all words the same length of time
- b) Weighted average of the sound waveform

FIG. 8 is a flow chart that schematically illustrates a method of generating an ambiguous speech output for an ambiguous silent speech input, in accordance with an embodiment of the invention. The process begins at words generation step 802, with multiple candidate words are extracted by speech generation application 80 together with respective probabilities that each of the candidate words corresponds to the intended speech in a given time interval.

Next, at words synthesizing step 804, a processor synthesizes the extracted words into audio signals (e.g., 1D sound waveforms). At a time-scaling step 806, the processor, such as a one of a voice synthesizer (e.g., one example of application 84), time scales similar words to make all words sound over a same time duration within a given time interval. Finally, at sound mixing step 808, the audible speech is synthesized into an ambiguous audio output by mixing the multiple words responsively to the respective probabilities (e.g., by the processor performing weighted summation of the sound waveform amplitudes with the probabilities that serve as respective weights).

It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.

Claims

1. A light sensing system for determining textual transcription from minute facial skin movements, the system comprising: at least one light source;at least one sensor configured to receive light reflections from the at least one light source;at least one processor configured to: control the at least one light source to illuminate a region of a face of a user;receive from the at least one sensor, reflection signals indicative of light reflected from the face in a time interval;analyze the reflection signals to determine minute facial skin movements in the time interval;based on the determined minute facial skin movements in the time interval, determine a sequence of words associated with the minute facial skin movements, wherein determining the sequence of words includes using an artificial neural network and a motion-to-language analysis; andoutput a textual transcription corresponding with the determined sequence of words.
2. The light sensing system of claim 1, wherein the minute facial skin movements are associated with a vocalization of the sequence of words in a first language and the wherein the at least one processor is further configured to translate the textual transcription to a language other than the first language.
3. The light sensing system of claim 1, wherein the at least one processor is further configured to generate speech output based on the facial skin movements.
4. The light sensing system of claim 1, wherein the at least one processor is further configured to generate an email draft from the textual transcription.
5. The light sensing system of claim 1, wherein the at least one processor is further configured to generate a synthesized audio signal based on the facial skin movements.
6. The light sensing system of claim 1, wherein analyzing the reflection signals to determine minute facial skin movements includes detection of changes in speckle patterns that occur due to the minute facial skin movements.
7. The light sensing system of claim 1, wherein the reflection signals were during a conversation and determining the sequence of words involves determining a context of the conversation.
8. The light sensing system of claim 1, wherein the system is associated with a microphone.
9. A light sensing system for determining speech based on minute facial skin movements, the sensing system comprising: a housing configured to be worn on a head of a user and to be supported by an ear of the user;at least one light source associated with the housing and configured to direct light towards a facial region of the head;at least one sensor associated with the housing and configured to receive light source reflections from the facial region and to output associated reflection signals;at least one processor configured to: receive the reflection signals from the at least one sensor in a time interval;analyze the reflection signals to identify minute facial movements in the time interval, wherein analyzing the reflection signals to identify the minute facial skin movements includes detecting changes in speckle patterns that occur due to the minute facial skin movements;decipher the minute facial movements to determine associated speech; andgenerate output of the associated speech.
10. The light sensing system of claim 9, wherein the housing is configured, when worn, to assume an aiming direction of the at least one light source for illuminating a portion of a cheek of the user.
11. The light sensing system of claim 9, wherein the housing is associated with an earphone.
12. The light sensing system of claim 11, wherein the earphone has an arm extending therefrom and wherein the at least one light source is located in the arm.
13. The light sensing system of claim 9, further comprising a microphone configured to sense sounds uttered by the user.
14. The light sensing system of claim 13, wherein the at least one processor is configured to calibrate spoken words detected by the microphone with the identified minute facial movements.
15. A light sensing system for determining silent speech based on minute facial skin movements, the light sensing system, comprising: at least one wearable light source configured to direct light towards a portion of a cheek of a user;at least one wearable sensor configured to receive light source reflections from the portion of the cheek and to output associated reflection signals;at least one processor operable in an idle mode and in a high power mode; wherein in the idle mode, the at least one processor is configured to receive the reflection signals from the at least one sensor, process the reflection signals to identify from reflection signals associated at least one movement in the portion of the cheek at least one trigger, and automatically switch to the high power mode upon identification of the at least one trigger;wherein in the high power mode, the at least one processor is configured to analyze the reflection signals from the portion of the cheek to identify facial movements associated with silent speech; andwherein following the identification of the facial movements associated with the silent speech, the at least one processor is configured to decipher the facial movements in the portion of the cheek and generate an output associated with the silent speech.
16. The light sensing system of claim 15, wherein the facial movements associated with silent speech include minute movements manifest in cheek skin of the portion of the cheek, and wherein the at least one processor is configured to generate the output associated with the silent speech by deciphering the movements manifest in the cheek skin.
17. The light sensing system of claim 15, wherein the system is associated with a microphone.
18. The light sensing system of claim 15, wherein, wherein a frame rate in the idle mode is lower than a frame rate in the high power mode.
19. The light sensing system of claim 15, wherein the at least one processor is further configured to determine a sequence of words from the identified deciphered facial movements.
20. The light sensing system of claim 19, wherein determining the sequence of words involves using a neural network associated with personal training data.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. application Ser. No. 18/181,787, filed on Mar. 10, 2023, which is a continuation of International Application No. PCT/IB2022/056418, filed on Jul. 12, 2022, which claims the benefit of U.S. Provisional Application No. 63/229,091, filed on Aug. 4, 2021, and is a continuation-in-part of International Application No. PCT/IB2022/054527, filed on May 16, 2022, all of which are incorporated herein by reference in their entirety.

US Referenced Citations (192)

Number	Name	Date	Kind
5826234	Lyberg	Oct 1998	A
5943171	Budd et al.	Aug 1999	A
5995856	Mannheimer et al.	Nov 1999	A
6219640	Basu et al.	Apr 2001	B1
6272466	Harada et al.	Aug 2001	B1
6598006	Honda et al.	Jul 2003	B1
7027621	Prokoski	Apr 2006	B1
7222360	Miller	May 2007	B1
7859654	Hartog	Dec 2010	B2
8082149	Schultz et al.	Dec 2011	B2
8200486	Jorgensen et al.	Jun 2012	B1
8410355	Meguro et al.	Apr 2013	B2
8638991	Zalevsky et al.	Jan 2014	B2
8792159	Zalevsky et al.	Jul 2014	B2
8860948	Abdulhalim et al.	Oct 2014	B2
8897500	Syrdal et al.	Nov 2014	B2
8970348	Evans et al.	Mar 2015	B1
9129595	Russell et al.	Sep 2015	B2
9199081	Zalevsky et al.	Dec 2015	B2
9263044	Cassidy et al.	Feb 2016	B1
9288045	Sadot et al.	Mar 2016	B2
9668672	Zalevsky et al.	Jun 2017	B2
9680983	Schuster et al.	Jun 2017	B1
9916433	Schwarz et al.	Mar 2018	B2
10299008	Catalano et al.	May 2019	B1
10335041	Fixler et al.	Jul 2019	B2
10398314	Zalevsky et al.	Sep 2019	B2
10431241	Cho et al.	Oct 2019	B2
10489636	Chen et al.	Nov 2019	B2
10529113	Sheikh	Jan 2020	B1
10529355	Rakshit et al.	Jan 2020	B2
10529360	Cho et al.	Jan 2020	B2
10592734	Klett	Mar 2020	B2
10614295	Kim et al.	Apr 2020	B2
10679644	Rakshit et al.	Jun 2020	B2
10838139	Zalevsky et al.	Nov 2020	B2
10867460	Miller et al.	Dec 2020	B1
10878818	Kapur et al.	Dec 2020	B2
10931881	Zalevsky et al.	Feb 2021	B2
11114101	Mossinkoff et al.	Sep 2021	B2
11169176	Zalevsky et al.	Nov 2021	B2
11257493	Vasconcelos et al.	Feb 2022	B2
11341222	Caffey	May 2022	B1
11343596	Chappell, III et al.	May 2022	B2
11467659	Bikumandla et al.	Oct 2022	B2
11538279	Nduka et al.	Dec 2022	B2
11605376	Hoover	Mar 2023	B1
11609633	Alcaide et al.	Mar 2023	B2
11636652	Kaehler	Apr 2023	B2
11682398	Im et al.	Jun 2023	B2
11709548	Tadi et al.	Jul 2023	B2
11744376	Schmidt et al.	Sep 2023	B2
11893098	Lawrenson et al.	Feb 2024	B2
20020065633	Levin	May 2002	A1
20030123712	Dimitrova et al.	Jul 2003	A1
20040240712	Rowe et al.	Dec 2004	A1
20040243416	Gardos	Dec 2004	A1
20040249510	Hanson	Dec 2004	A1
20060206724	Schaufele	Sep 2006	A1
20060287608	Dellacorna	Dec 2006	A1
20070047768	Gordon et al.	Mar 2007	A1
20080043025	Isabelle et al.	Feb 2008	A1
20080103769	Schultz et al.	May 2008	A1
20080177994	Mayer	Jul 2008	A1
20080216171	Sano et al.	Sep 2008	A1
20090082642	Fine	Mar 2009	A1
20090233072	Harvey et al.	Sep 2009	A1
20100141663	Becker et al.	Jun 2010	A1
20100328433	Li	Dec 2010	A1
20110160622	Mcardle et al.	Jun 2011	A1
20110307241	Waibel et al.	Dec 2011	A1
20120040747	Auterio et al.	Feb 2012	A1
20120209603	Jing	Aug 2012	A1
20120284022	Konchitsky	Nov 2012	A1
20130177885	Kirkpatrick	Jul 2013	A1
20130178287	Yahav	Jul 2013	A1
20130300573	Brown et al.	Nov 2013	A1
20140126743	Petit	May 2014	A1
20140206980	Lee et al.	Jul 2014	A1
20140375571	Hirata	Dec 2014	A1
20150117830	Faaborg	Apr 2015	A1
20150253502	Fish et al.	Sep 2015	A1
20150356981	Johnson et al.	Dec 2015	A1
20160004059	Menon et al.	Jan 2016	A1
20160011063	Zhang et al.	Jan 2016	A1
20160027441	Liu et al.	Jan 2016	A1
20160034252	Chabrol	Feb 2016	A1
20160086021	Grohman et al.	Mar 2016	A1
20160093284	Begum et al.	Mar 2016	A1
20160100787	Leung et al.	Apr 2016	A1
20160116356	Goldstein	Apr 2016	A1
20160150978	Yuen et al.	Jun 2016	A1
20160314781	Schultz et al.	Oct 2016	A1
20160374577	Baxi et al.	Dec 2016	A1
20160379638	Basye et al.	Dec 2016	A1
20160379683	Sandrew et al.	Dec 2016	A1
20170068839	Fukuda	Mar 2017	A1
20170084266	Bronakowski et al.	Mar 2017	A1
20170209047	Zalevsky et al.	Jul 2017	A1
20170222729	Sadot et al.	Aug 2017	A1
20170231513	Presura et al.	Aug 2017	A1
20170245796	Zalevsky et al.	Aug 2017	A1
20170263237	Green et al.	Sep 2017	A1
20170374074	Stuntebeck	Dec 2017	A1
20180020285	Zass	Jan 2018	A1
20180025750	Smith et al.	Jan 2018	A1
20180048954	Förstner et al.	Feb 2018	A1
20180070839	Ritscher et al.	Mar 2018	A1
20180107275	Chen et al.	Apr 2018	A1
20180132766	Lee et al.	May 2018	A1
20180149448	Stolov	May 2018	A1
20180158450	Tokiwa	Jun 2018	A1
20180232511	Bakish	Aug 2018	A1
20180292523	Orenstein et al.	Oct 2018	A1
20180306568	Holman et al.	Oct 2018	A1
20180333053	Verkruijsse et al.	Nov 2018	A1
20190012528	Wilson et al.	Jan 2019	A1
20190029528	Tzvieli et al.	Jan 2019	A1
20190041642	Haddick	Feb 2019	A1
20190074012	Kapur et al.	Mar 2019	A1
20190074028	Howard	Mar 2019	A1
20190080153	Kalscheur et al.	Mar 2019	A1
20190086316	Rice et al.	Mar 2019	A1
20190096147	Park et al.	Mar 2019	A1
20190189145	Rakshit et al.	Jun 2019	A1
20190197224	Smits	Jun 2019	A1
20190198022	Varner et al.	Jun 2019	A1
20190277694	Sadot et al.	Sep 2019	A1
20190340421	Boenapalli et al.	Nov 2019	A1
20190348041	Cella et al.	Nov 2019	A1
20200013407	Chae	Jan 2020	A1
20200020352	Ito et al.	Jan 2020	A1
20200034608	Nduka et al.	Jan 2020	A1
20200075007	Kawahara et al.	Mar 2020	A1
20200081530	Greenberg	Mar 2020	A1
20200126283	Van Vuuren et al.	Apr 2020	A1
20200152197	Penilla	May 2020	A1
20200205707	Sanyal et al.	Jul 2020	A1
20200237290	Einfalt et al.	Jul 2020	A1
20200257785	Li et al.	Aug 2020	A1
20200284724	Dholakia	Sep 2020	A1
20200300970	Nguyen et al.	Sep 2020	A1
20200319301	Qiu et al.	Oct 2020	A1
20200342201	Wilhelm	Oct 2020	A1
20200370879	Mutlu et al.	Nov 2020	A1
20200383628	Borremans et al.	Dec 2020	A1
20210027154	Zalevsky et al.	Jan 2021	A1
20210035585	Gupta	Feb 2021	A1
20210052368	Smadja et al.	Feb 2021	A1
20210063563	Zalevsky et al.	Mar 2021	A1
20210072153	Zalevsky et al.	Mar 2021	A1
20210169333	Zalevsky et al.	Jun 2021	A1
20210172883	Zalevsky et al.	Jun 2021	A1
20210195142	Mireles et al.	Jun 2021	A1
20210209388	Ciftci et al.	Jul 2021	A1
20210235202	Wexler et al.	Jul 2021	A1
20210250696	Qi et al.	Aug 2021	A1
20210255488	Piestun et al.	Aug 2021	A1
20210256246	Dagdeviren et al.	Aug 2021	A1
20210271861	Nduka et al.	Sep 2021	A1
20210318558	Tzvieli et al.	Oct 2021	A1
20210365533	Kaplan et al.	Nov 2021	A1
20210386409	Clouse et al.	Dec 2021	A1
20210407533	Cowburn et al.	Dec 2021	A1
20220060230	Na et al.	Feb 2022	A1
20220065617	Goodwin et al.	Mar 2022	A1
20220067134	Wan	Mar 2022	A1
20220078369	Bartha et al.	Mar 2022	A1
20220084196	Ogawa et al.	Mar 2022	A1
20220084529	He et al.	Mar 2022	A1
20220099431	Chen et al.	Mar 2022	A1
20220117558	Nicolae et al.	Apr 2022	A1
20220125286	Zalevsky et al.	Apr 2022	A1
20220132217	Aher et al.	Apr 2022	A1
20220156485	Tzvieli et al.	May 2022	A1
20220160296	Rahmani et al.	May 2022	A1
20220163444	Zalevsky	May 2022	A1
20220189131	Nouri et al.	Jun 2022	A1
20220261465	Levitov	Aug 2022	A1
20220309837	Boic et al.	Sep 2022	A1
20220310109	Donsbach et al.	Sep 2022	A1
20220391170	Kim et al.	Dec 2022	A1
20230077010	Zhang et al.	Mar 2023	A1
20230215437	Maizels et al.	Jul 2023	A1
20230230574	Maizels et al.	Jul 2023	A1
20230230575	Maizels et al.	Jul 2023	A1
20230230594	Maizels et al.	Jul 2023	A1
20230267914	Maizels et al.	Aug 2023	A1
20230267942	Efros	Aug 2023	A1
20230293084	Argyropoulos et al.	Sep 2023	A1
20240211563	Khaleghimeybodi et al.	Jun 2024	A1
20240212388	Li et al.	Jun 2024	A1

Foreign Referenced Citations (11)

Number	Date	Country
105488524	Apr 2016	CN
1881319	Jan 2008	EP
3745303	Feb 2020	EP
2009013738	Jan 2009	WO
2013188343	Dec 2013	WO
2019017841	Jan 2019	WO
2021040747	Mar 2021	WO
2022034117	Feb 2022	WO
2002077972	Oct 2022	WO
2023012527	Feb 2023	WO
2023012546	Feb 2023	WO

Non-Patent Literature Citations (38)

Entry
Guzelsu et al. Measurement of skin stretchy via light reflection. Journal of Biomedical Optics. Jan. 2003;81:80-86.
Zhang et al., “SpeeChin: A Smart Necklace for Silent Speech Recognition,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), vol. 5, No. 4, article No. 192, pp. 1-23, Dec. 2021.
Fleischman, “Smart necklace recognizes English, Mandarin commands,” Cornell Chronicle, pp. 1-3, Feb. 14, 2022.
Kalyuzhner et al., “Remote photonic detection of human senses using secondary speckle patterns,” Nature Portfolio, Scientific Reports, vol. 12, pp. 1-9, year 2022.
International Application # PCT/IB2022/054527 Search Report dated Aug. 30, 2022.
Makin et al., “Machine translation of cortical activity to text with an encoder-decoder framework,” Nature Neuroscience, vol. 23, No. 4, pp. 575-582, year 2020.
Gaddy et al., “Digital voicing of silent speech,” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 5521-5530, year 2020.
Wigdahl, “The Road Ahead for Speech Recognition Technology,” Coruzant Technologies, pp. 1-5, Aug. 2, 2021, as downloaded from https://coruzant.com/ai/the-road-ahead-for-speech-recognition-technology/TWIGDAHLW.
Holst, “Number of digital voice assistants in use worldwide 2019-2024,” Statista, pp. 1-1, Jan. 4, 2021.
Spandaslui, “Are You Too Embarrassed to Use Siri, Cortana or Google Voice Commands in Public?,” LifeHacker Australia, pp. 1-6, Jun. 8, 2016, as downloaded from https://www.lifehacker.com.au/2016/06/are-you-embarrassed-to-use-siri-cortana-or-ok-google-in-public/.
Statista Research Department, “Global sales volume for true wireless hearables,” pp. 1-1, Jan. 15, 2021.
Cherrayil, “Augmented reality-based devices to replace smartphones in future,” Techchannel News, pp. 1-2, Sep. 6, 2021, as downloaded from https://techchannel.news/06/09/2021/ar-based-devices-set-to-replace-smartphones-in-future/.
Dulak et al., “Neuroanatomy, Cranial Nerve 7 (Facial),” NCBI Bookshelf, pp. 1-8, Jul. 24, 2023.
Nelson Longenbaker, “Mader's Understanding Human Anatomy & Physiology”, 9th Edition, McGraw Hill Education, pp. 1-513, year 2017.
Learneo, Inc., “Neuromuscular Junctions and Muscle Contractions,” Nursing Hero, Anatomy and Physiology I, Module 8: Muscle Tissue, pp. 1-20, year 2024.
Dainty (ed.), “Laser Speckle and Related Phenomena,” Topics in Applied Physics, vol. 9, pp. 1-295, year 1975.
Gaddy et al., “An Improved Model for Voicing Silent Speech,” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Short Papers), pp. 175-181, Aug. 2021.
Fu et al., “Ultracompact meta-imagers for arbitrary all-optical convolution,” Light: Science & Applications, vol. 11, issue 62, pp. 1-13, year 2022.
McWorther, “The world's most musical languages,” The Atlantic, pp. 1-9, Nov. 13, 2015.
Mingxing et al., “Towards optimizing electrode configurations for silent speech recognition based on high-density surface electromyography,” Journal of Neural Engineering, vol. 18, pp. 1-15, year 2021.
Cvetkovska, “26 Beard Statistics and Facts You Probably Didn't Know”, pp. 1-11, Jan. 1, 2021, as downloaded from https://web.archive.org/web/20210730125541/https://moderngentlemen.net/beard-statistics/.
Gilette, “10 facts on the science of beard growth”, pp. 1-5, year 2021, as downloaded from https://gillette.com/en-us/shaving-tips/how-to-shave/beard-growth-science.
Rietzler et al., “The male beard hair and facial skin—challenges for shaving,” Symposium presentation at the 23rd World Congress of Dermatology in Vancouver, Canada, pp. 1-19, Jun. 2015.
Janke et al., “EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, issue 12, pp. 2375-2385, Dec. 1, 2017.
MIT Media Lab, “Fluid Interfaces—Project AlterEgo,” pp. 1-3, Jun. 21, 2021, as downloaded from https://web.archive.org/web/20210621110900/https://www.media.mit.edu/projects/alterego/overview/.
Brigham Young University, “Skeletal muscle: Whole muscle physiology—Motor units,” Atonomy & Physiology, pp. 1-7, year 2021.
Krans et al., “The sliding filament theory of muscle contraction,” Nature Education, vol. 3, issue 9, article No. 66, pp. 1-11, year 2010.
Warden et al., “Launching the Speech Commands Dataset”, pp. 1-3, Aug. 24, 2017, as downloaded from https://research.google/blog/launching-the-speech-commands-dataset/.
O'Neill et al., “SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,” arXiv:2104.02014v2, pp. 1-5, Apr. 6, 2021.
Jou et al., “Towards Continuous Speech Recognition Using Surface Electromyography,” Conference paper, Interspeech 2006—ICSLP—9th International conference on spoken language processing, pp. 1-4, Sep. 2006.
Ko et al., “Audio Augmentation for Speech Recognition,” Conference paper, Sixteenth annual conference of the international speech communication association, pp. 1-4, year 2015.
Nalborczyk et al., “Can we decode phonetic features in inner speech using surface electromyography?,” Plos One, pp. 1-16, May 27, 2020.
International Application # PCT/IB2022/056418 Search Report dated Oct. 31, 2022.
Nicolo et al., “The importance of respiratory rate monitoring: from health-care to sports and exercise,” Sensors, vol. 20, issue 21, article No. 6396, pp. 1-46, Nov. 2020.
Office Action from India Intellectual Property Office in Indian patent application No. 202447014135, mailed Jun. 20, 2024 (7 pages).
“Chandrashekhar, V.,” The Classification of EMG Signals Using Machine Learning for the Construction of a Silent Speech InterfaceN/A.
International Search Report and Written Opinion from International Application No. PCT/IB2022/054527 mailed Aug. 30, 2022 (7 pages).
International Search Report and Written Opinion from International Application No. PCT/IB2022/056418 mailed Oct. 31, 2022 (11 pages).

Related Publications (1)

	Number	Date	Country
	20240177713 A1	May 2024	US

Provisional Applications (1)

	Number	Date	Country
	63229091	Aug 2021	US

Continuations (2)

	Number	Date	Country
Parent	18181787	Mar 2023	US
Child	18431563		US
Parent	PCT/IB2022/056418	Jul 2022	WO
Child	18181787		US

Continuation in Parts (1)

	Number	Date	Country
Parent	PCT/US2022/054527	May 2022	WO
Child	PCT/IB2022/056418		US

Speech detection from facial skin movements

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

International Classifications

Disclaimer

Term Extension

Abstract