This invention relates to a medical emergency alarm system. More particularly, this invention is in the field of voice-operated emergency alarm systems.
When an emergency occurs, especially for someone with preexisting medical problems, whose medical problem unexpectedly worsens, one's life depends on how quickly one can get medical help. In general, this group of people lives a normal life outside the hospital, but carries a mobile medical emergency alarm (the alarm) with them at all times so that in case of emergency, the alarm user can active the alarm and send out emergency signals for help. Usually, the medical emergency alarm system includes at least one mobile user-carried medical emergency alarm and one receiver located nearby. In the simplest set up, the alarm system has one alarm and one receiver. Upon activation by the user during an emergency, the alarm sends out signals to the nearby receiver, which is similar to the base unit of a cordless phone in function and size that in turn is connected to the telephone or data network directly.
Next, through the receiver, the emergency signals will be transmitted to an emergency monitoring center where operators stand by day and night to handle incoming emergency calls. From the received emergency signals, the operator can identify where and from whom the emergency signals are coming and will try to get in contact with the caller, usually through the phone system, to further investigate the incident. If the operator can't get in touch with the caller, the operator will assume that an emergency has happened to the caller and the caller is badly in need of help. Therefore, the operator will dispatch an ambulance to a pre-determined location, presumably the caller's home, to help the caller.
The mobile medical emergency alarms used in the current market are either a sensor-based alarm or a push-button based alarm. The sensor-based alarm is equipped with different sensors to monitor the occurrence of different, specific abnormal conditions. For example, one sensor may be set up to monitor any sudden falls of the user. If the user unexpectedly loses balance and falls down accidentally, the falling impact will active the alarm to send out an emergency signal to the monitor center. Depending on the needs, the sensor-based alarm can be customized to be equipped with different sensors to monitor different variables, such as body temperature, heart beats or other vital signs of the user. Once the sensor detects an abnormal condition occurred it will invoke the alarm, and the alarm system will automatically send emergency signals out to a designated monitoring center, which will notify the police or dispatch an ambulance to the location to help the user. However, multi-purpose sensor-based alarms can be expensive. At the same time, the push-button based medical alarm requires that in case of emergency, the alarm user must push a designated button on the alarm to active the alarm and send the emergency signal out to the monitor center.
However, there are abnormal conditions that are not covered by the built-in sensor-based alarm, or the push-button based alarm holder may, for some reason, be incapable of pushing the emergency button on the alarm to ask for help. So there is a need for a voice-operated alarm system that will help the user if help is required. In this case, a user using a voice-operated alarm system can utter keywords to active the alarm, which can then send emergency signals calling for help.
Due to the recent advancements of the automatic speech recognition (ASR) technology and the keyword spotting technique, it is feasible to implement the ASR and keyword spotting algorithms in a small device, which can recognize verbal keywords uttered by users. The voice-operated alarm system with a built-in ASR can make the usage of medical alarm systems more flexible and user-friendly. When the voice-operated device detects a pre-defined keyword or a combination of keywords, such as “help, help” or a special sound(s) from a user, it will active the alarm and in turn send emergency signals to a receiver. The receiver then automatically dials an operator at the emergency monitoring center. Furthermore, the voice-operated alarm can also have the sensors and the button if needed to give users more choices.
The invention, a voice-operated alarm system, includes an alarm and a receiver. The alarm is comprised of a microphone unit, a voice detector, a noise reduction unit, a speech recognizer or a keyword spotter which can recognize predefined keywords, and a signal transmitter. The speech recognizer and the keyword spotter can be speaker dependent, speaker independent, or both. A speaker-dependent system needs training, while a speaker-independent system does not. A receiver is located near the alarm and is connected to a telephone or data network. The receiver further communicates with the emergency monitoring center. Furthermore, the alarm system can also be implemented as a wireless phone with keyword spotting/recognition function. In this embodiment, the alarm user can communicate with the operators directly, like a cell phone, but the dial up function is replaced by uttering keywords. In this implementation, the receiver may not be necessary.
The input analog signals, are collected by a microphone component or a microphone array. The microphone array includes more than one microphone component. Each microphone component is coupled with an analog-to-digital converter (ADC), the ADCs convert the received analog voice signals into digital signals and forward the output to an array signal processing unit, where the multiple channels of speech signals are further processed using an array signal processing algorithm and the output of the array processing unit is one channel of speech signals with improved signal-to-noise ratio (SNR) (Step 44). Many existing array signal processing algorithms, such as the delay-and-sum algorithm, filter-and-sum algorithm, adaptive algorithms, or others, can be implemented to improve the SNR of the input signals. The delay-and-sum algorithm measures the delay on each of the microphone channels, aligns the multiple channel signals, and sums them together at every digital sampling point. Because the speech signal has a very large correlation at each of the channels, the speech signal can be enhanced by the operation. At the same time, the noise signals have less, or no, correlation at each of the microphone channels; when adding the multiple-channel signals together, noise signals can be cancelled or reduced.
The filter-and-sum algorithm is more general than the delay-and-sum algorithm, which has one digital filter in each input channel, plus one summation unit. In our invention, the array signal processor can be a linear or nonlinear device. In the case of a nonlinear device, the filters can be replaced by a neural network or a nonlinear system. The parameters of the filters can be designed by existing algorithms or can be trained in a data driven approach that is similar to training a neural network in pattern recognition. In another implementation, the entire array signal microprocessor can be implemented as a neural network and a multi-input-one-output system, and the network parameters can be trained by pre-collected or pre-generated training data.
Moreover, because the microphone array consists of a set of microphones that are spatially distributed at known locations with reference to a common sound source, the invention can implement an array signal processing algorithm, by weighting the microphone outputs, and an acoustic beam can be formed and steered to the directions of the source of the sound, e.g. speaker's mouth. Consequently, a signal propagating from the direction pointed by the acoustic beam is reinforced, while sound sources originating from directions other than the direction are attenuated; therefore, all the microphone components can work together as a microphone array to improve the signal-to-noise ratio (SNR). The microphone array can find the source of the sound and can follow the sound's location by an adaptive algorithm. The output of the digital array signal microprocessor is one-channel digitized speech signals where the SNR is improved by an array signal processing algorithm with or without adaptation.
Referring back to
In both keyword spotting and speech recognition, the input speech signal is first converted into acoustic features in the frequency domain. This step is called feature extraction Step 48. Although any algorithm can be used in the step, we prefer auditory-based algorithms that convert input time-domain signal into frequency-domain feature vectors by simulating the function in human auditory system. The noise reduction (Step 46) can be implemented independently or in combination with this feature extraction step.
The speech feature from Step 48 is then forwarded to a keyword spotting or speech recognizer unit 20 (Step 48). Keyword spotting is the algorithm of spotting keywords from the input speech signal while the speech recognizer converts the input speech signal into text. When the keywords are spotted or recognized, a control signal can be transmitted from the alarm to the receiver 12 to dial an operator Step 52.
In the keyword spotting algorithm, there are two kinds of statistical models: keyword models and garbage models. The keyword models are used to model the acoustic characteristics of the keywords while the garbage models are used to model all sounds, voice and noise, other than the keywords. During a decoding process by using a search algorithm, such as the Viterbi algorithm, the input feature vectors from Step 48 are compared with the keyword models and the garbage models. If the features match the keyword models better than the garbage models, a keyword is found and a control signal is transmitted to the receiver 12; otherwise, there is no keyword in the feature vectors and the decoding process keeps searching and comparing. The degree of match between the model and feature vectors is measured by computing likelihood scores or other kind of score during searching. When the feature vectors match the acoustic keyword models, the keyword is found. Consequentially, this will invoke the alarm to send out emergency signals to the device 12.
In speech recognition algorithms, there are phonemes or speech sub-word models to represent the characteristics of spoken words. Those models are pre-trained by labeled speech data. During a decoding process, the feature vectors from Step 48 are compared with the pre-trained acoustic models and pre-trained language models which represents the constrains of a language grammar. Basically, the feature vectors of an uttered speech keyword are compared with the acoustic models using a searching algorithm or detection algorithm, such as the Viterbi algorithm. The degree of match between the model and feature vectors is measured by computing likelihood scores or other kind of score during searching. When the feature vectors match the acoustic models of the keyword, the keyword is found. Consequentially, this will invoke the alarm to send out emergency signals to the device 12.
The statistical acoustic models in either the keyword spotting algorithm or the speech recognition algorithm can be speaker dependent or speaker independent. In the case of speaker dependent, the models are trained based on the user's voice of the keywords or other sounds, so the alarm only words for the particular user. In the case of speaker independent, the models are trained based on many users' voices, so the trained models can generally match any users' voice and the alarm can work for any user without any training. A speaker dependent alarm can be adapted from a speaker independent alarm by asking the user to do utter the keywords for several times for training.
The transmitted control signals from the alarm to the device can be in any frequency bands, such as in the frequency bands of cordless phones, the Wi-Fi bands, or any wireless signal bands. The transmitted information can be coded for in any method for any reason.
The alarm can also be implemented as a wireless phone, but replacing the key pad dialing by keyword uttering. In this implementation, the operator can talk with the user directly and the receiver can be eliminated.