The present disclosure relates to an electronic enclosure or encasement advantageously configured for an incubator or similar device, where excessive noise may be an issue. In particular, the present disclosure relates to an electronic enclosure including active noise control, and communication.
In U.S. patent application Ser. No. 11/952,250, referenced above and assigned to the assignee of the present application, techniques were disclosed for abating noise, such as snoring, in the vicinity of a human head by utilizing Adaptive Noise Control (ANC). More specifically, utilizing a multiple-channel feed-forward ANC system using adaptive FIR filters with an 1×2×2 FXLMS algorithm, a noise suppression system may be particularly effective at reducing snoring noises. While noise suppression is desirous for adult humans, special requirements may be needed in the cases of babies, infants, and other life forms that may have sensitivity to noise.
Newborn babies, and particularly premature, ill, and low birth weight infants are often placed in special units, such as neonatal intensive care units (NICUs) where they require specific environments for medical attention. Devices such as incubators have greatly increased the survival of very low birth weight and premature infants. However, high levels of noise in the NICU have been shown to result in numerous adverse health effects, including hearing loss, sleep disturbance and other forms of stress. At the same time, an important relationship during infancy is the attachment or bonding to a caregiver, such as a mother and/or father. This is due to the fact that this relationship may determine the biological and emotional ‘template’ for future relationships and well-being. It is generally known that healthy attachment to the caregiver through bonding experiences during infancy may provide a foundation for future healthy relationships. However, infants admitted to an NICU may lose such experiences in their earliest life due to limited interaction their parents due to noise and/or means of communication. Therefore, it is important to reduce noise level inside incubator and increase bonding opportunities for NICU babies and their parents. In addition, there are advantages for newborns inside the incubators to hear their mothers' voice which can help release the stress and improve language development. Communicating with NICU babies can also benefit the new mothers, such as, preventing postpartum depression, improving bonding, etc.
Regarding communication, it would be advantageous to provide “cues” to a caregiver based on an infant's cry, so that the infant may be understood, albeit on a rudimentary level. These cues may be advantageous for interpreting a likely condition of the infant via its vocal communication. Unlike adults, the airways of newborn infants are quite different from those of adults. The larynx in newborn infants is positioned close to the base of the skull. The high position of the larynx in the newborn is similar to its position in other animals and allows the newborn human to form a sealed airway from the nose to the lungs. The soft palate and epiglottis provide a “double seal,” and liquids can flow around the relatively small larynx into the esophagus while air moves through the nose, through the larynx and trachea into the lungs. The anatomy of the upper airways in newborn infants is “matched” to a neural control system (newborn infants are obligated nose breathers). They normally will not breathe through their mouths even in instances where their noses may be blocked. The unique configuration of the vocal tract is the reason for the extremely nasalized cry of the infant.
From one perspective, the increasing alertness and decreasing crying as part of the sleep/wakefulness cycle suggests that there may be a balanced exchange between crying and attention. The change from sleep/cry to sleep/alert/cry necessitates the development of control mechanisms to modulate arousal. The infant must increase arousal more gradually, in smaller increments, to maintain states of attention for longer periods. Crying is a heightened state of arousal produced by nervous system excitation triggered by some form of perceived threat, such as hunger, pain, or sickness, or individual differences in thresholds for stimulation. Crying is modulated and developmentally facilitated by control mechanisms to enable the infant to maintain non-crying states.
The cry serves as the primary means of communication for infants. While it is possible for experts (experienced parents and child care specialists) to distinguish infant cries though training and experience, it is difficult for new parents and for inexperienced child care workers to interpret infant cries. Accordingly, techniques are needed to extract audio features from the infant cry so that different communicated states for an infant may be determined. Cry Translator™, a commercially available product known in the art, claims to be able to identify five distinct cries: hunger, sleep, discomfort, stress and boredom. An exemplary description of the product may be found in US Pat. Pub. No. 2008/0284409, titled “Signal Recognition Method With a Low-Cost Microcontroller,” which is incorporated by reference herein. However, such configurations are less robust, provide limited information, are not necessarily suitable for NICU applications, and do not provide integrated noise reduction.
Accordingly, there is a need for infant voice analysis, as well as a need to coupled voice analysis with noise reduction. Using an infant's cry as a diagnostic tool may play an important role in determining infant voice communication, and for determining emotional, pathological and even medical conditions, such as SIDS, problems in developmental outcome and colic, medical problems in which early detection is possible only by invasive procedures such as chromosomal abnormalities, etc. Additionally, related techniques are needed for analyzing medical problems which may be readily identified, but would benefit from an improved ability to define prognosis (e.g., prognosis of long term developmental outcome in cases of prematurity and drug exposure).
Under one exemplary embodiment, an enclosure, such as an incubator and the like, is disclosed comprising a noise cancellation portion, comprising a controller unit, configured to be operatively coupled to one or more error microphones and a reference sensing unit, wherein the controller unit processes signals received from one or more error microphones and reference sensing unit to reduce noise in an area within the enclose using one or more speakers. The enclosure includes a communications portion, comprising a sound analyzer and transmitter, wherein the communication portion is operatively coupled to the noise cancellation portion, said communications portion being configured to receive a voice signal from the enclosure and transform the voice signal to identify characteristics thereof
In another exemplary embodiment, a method is disclosed for providing noise cancellation and communication within an enclosure, where the method includes the steps of processing signals, received from one or more error microphones and reference sensing unit, in a controller of a noise cancellation portion to reduce noise in an area within the enclose using one or more speakers; receiving internal voice signals from the enclosure; transforming the internal voice signals; and identifying characteristics of the voice signals based on the sound analyzing.
In a further exemplary embodiment, an enclosure is disclosed comprising a noise cancellation portion, comprising a controller unit, configured to be operatively coupled to one or more error microphones and a reference sensing unit, wherein the controller unit processes signals received from one or more error microphones and reference sensing unit to reduce noise in an area within the enclose using one or more speakers; a communications portion, comprising a sound analyzer and transmitter, wherein the communication portion is operatively coupled to the noise cancellation portion, said communications portion being configured to receive a voice signal from the enclosure and transform the voice signal to identify characteristics thereof; and a voice input apparatus operatively coupled to the noise cancellation portion, wherein the voice input apparatus is configured to receive external voice signals for reproduction on the one or more speakers.
In still further exemplary embodiments, the communications/signal recognition portion described above may be configured to transform the voice signal from a time domain to a frequency domain, wherein the transformation comprises at least one of linear predictive coding (LPC), Mel-frequency cepstral coefficients (MFCC), Bark-frequency cepstral coefficients (BFCC) and short-time zero crossing. The communications portion may be further configured to identify characteristics of the transformed voice signal using at least one of a Gaussian mixture model (GMM), hidden Markov model (HMM), and artificial neural network (ANN). In yet another exemplary embodiment, the enclosure described above may include a voice input operatively coupled to the noise cancellation portion, wherein the voice input is configured to receive external voice signals for reproduction on the one or more speakers, wherein the noise cancellation portion is configured to filter the external voice signals to minimize interference with signals received from one or more error microphones and reference sensing unit for reducing noise in the area within the enclose.
Other advantages will be readily appreciated as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings wherein:
As is known from U.S. patent application Ser. No. 11/952,250, noise reduction may be enabled in an electronic encasement comprising an encasement unit (e.g., pillow) in electrical connection with a controller unit and a reference sensing unit. The encasement unit may comprise at least one error microphone and at least one loudspeaker that are in electrical connection with the controller unit. Under a preferred embodiment, two error microphones may be used, positioned to be close to the ears of a subject (i.e., human). The error microphones may be configured to detect various signals or noises created by the user and relay these signals to the controller unit for processing. For example, the error microphones may be configured to detect speech sounds from the user when the electronic encasement is used as a hands-free communication device. The error microphones may also be configured to detect noises that the user hears, such as snoring or other environmental noises when the electronic encasement is used for ANC. A quiet zone created by ANC is centered at the error microphones. Accordingly, placing the error microphones inside the encasement below the user's ears, generally around a middle third of the encasement, may ensure that the user is close to the center of a quiet zone that has a higher degree of noise reduction.
Additionally, there may be one or more loudspeakers in the encasement, also preferably configured to be relatively close to the user's ears. More or fewer loudspeakers can be used depending on the desired function. Under a preferred embodiment, the loudspeakers are configured to produce various sounds. For example, the loudspeakers can produce speech sound when the electronic encasement acts as a hands-free communication device, and/or can produce anti-noise to abate any undesired noise. In another example, the loudspeakers can produce audio sound for entertainment or masking of residual noise. Preferably, the loudspeakers are small enough so as not to be noticeable. There are advantages to placing the loudspeakers relatively close to ears of a user, as the level of anti-noise generated by the loudspeakers is maximized compared to configurations where loudspeakers are placed in more remote locations. Lower noise levels also tend to reduce power consumption and reduce undesired acoustic feedback from the loudspeakers back to the reference sensing unit. The configurations described above may be equally applicable to enclosures, such as an incubator, as well as encasements. Also, it should be understood by those skilled in the art that use of the term “enclosure” does not necessarily mean that an area around noise cancellation is fully enclosed. Partial enclosures, partitions, walls, rails, dividers etc. are equally contemplated herein.
Turning to
In the embodiment of
Generally speaking, the algorithm(s) may controls interactions between the error microphones, the loudspeakers, and reference microphones. Preferably, the algorithm(s) may be one of (a) multiple-channel broadband feed-forward active noise control for reducing noise, (b) adaptive acoustic echo cancellation, (c) signal detection to avoid recording silence periods and sound recognition for non-invasive detection, or (d) integration of active noise control and acoustic echo cancellation. Each of these algorithms are described more fully below. The DSP can also include other functions such as non-invasive monitoring using microphone signals and an alarm to alert or call caregivers for emergency situations.
The reference sensing unit includes at least one reference microphone. Preferably, the reference microphones are wireless for ease of placement, but they can also be wired. The reference microphones are used to detect the particular noise that is desired to be abated and are therefore placed near that sound. For example, if it is desired to abate noises in an enclosure from other rooms that can be heard through a door, the reference microphone may be placed directly on the door. The reference microphone may advantageously be placed near a noise source in order to minimize such noises near an enclosure. As will be described in further detail below, an enclosure equipped with noise-cancellation hardware may be used for a variety of methods in conjunction with the algorithms. For example, the enclosure can be used in a method of abating unwanted noise by detecting an unwanted noise with a reference microphone, analyzing the unwanted noise, producing an anti-noise corresponding to the unwanted noise in the enclosure, and abating the unwanted noise. Again, the reference microphone(s) may be placed wherever the noise to be abated is located. These reference microphones detect the unwanted noise and the error microphones 20 detect the unwanted noise levels at the enclosure's location, both reference microphones send signals to the input channels 32 of the controller unit 14, the signals are analyzed with an algorithm in the DSP, and signals are sent from the output channels 40 to the loudspeakers. The loudspeakers then produce an anti-noise (which may be produced by an anti-noise generator) that abates the unwanted noise. With this method, the algorithm of multiple-channel broadband feed-forward active noise control for reducing noise is used to control the enclosure.
The enclosure can also be used in a method of communication by sending and receiving sound waves through the enclosure in connection with a communication interface. The method operates essentially as described above; however, the error microphones are used to detect speech and the loudspeakers may broadcast vocal sounds. With this method, the algorithm of adaptive acoustic echo cancellation for communications may be used to control the enclosure, as described above, and this algorithm can be combined with active noise control as well. The configuration for the enclosure may be used in a method of recording and monitoring disorders, by recording noises produced by within the enclosure with microphones encased within a pillow. Again, this method operates essentially as described above; however, the error microphones are used to record sounds in the enclosure to diagnose sleep disorders. With this method, the algorithm of signal detection to avoid recording silence periods and sound recognition for non-invasive detection is used to control the enclosure.
The enclosure can further be used in a method of providing real-time response to emergencies by detecting a noise with a reference microphone in an enclosure, analyzing the noise, and providing real-time response to an emergency indicated by the analyzed noise. The method is performed essentially as described above. Certain noises detected are categorized as potential emergency situations, such as, but not limited to, the cessation of breathing, extremely heavy breathing, choking sounds, and cries for help. Detecting such a noise prompts the performance of real-time response action, such as producing a noise with the loudspeakers, or by notifying caregivers or emergency responders of the emergency. Notification can occur in conjunction with the communications features of the enclosure, i.e. by sending a message over telephone lines, wireless signal or by any other warning signals sent to the caregivers. The enclosure may also be used in a method of playing audio sound by playing audio sound through the loudspeakers of the enclosure. The audio sound can be any, such as soothing music or nature sounds. This method can also be used to abate unwanted noise, as the audio sound masks environmental noises. Also, by locating the loudspeakers inside the enclosure, lower volume can be used to play the audio sound.
Turning to
The 1×2×2 FXLMS algorithm may be summarized as follows:
y
1(n)=w1T(n)x(n),i=1,2 (1)
w
1(n+1)=w1(n)+μ1[e1(n)x(n)*ŝ11(n)+e2(n)x(n)*ŝ21(n)] (2)
w
2(n+1)=w2(n)+μ2[e1(n)x(n)*ŝ12(n)+e2(n)x(n)*ŝ22(n)] (3)
where w1(n) and w2(n) are coefficient vectors and μ1 and μ2 are the step sizes of the adaptive filters W1(z) and W2(z), respectively, and ŝ11(n), ŝ21(n), ŝ12(n) and ŝ22(n) are the impulse responses of the secondary path estimates Ŝ11(z), Ŝ12(z), Ŝ21(z), and Ŝ22(z) respectively.
Configurations directed to adaptive acoustic echo cancellation and integration of active noise control with acoustic echo cancellation are disclosed in U.S. patent application Ser. No. 11/952,250, and will not be repeated here for the sake of brevity. However, it should be understood by those skilled in the art that the techniques described therein may be applicable to the present disclosure, depending on the needs of the enclosure designer.
Turning to
The noise abatement of system 300 may be viewed as comprising four modules or units including (1) a noise control acoustic unit, (2) a electronic controller unit, (3) a reference sensors unit, and (4) a communication unit. The noise control acoustic unit includes one or more anti-noise loudspeakers 311, at least partially operated by anti-noise generator 306, and microphones (error microphone 307, and reference microphone 308), operatively coupled to an electronic controller which may be part of unit 306 and/or 301. The controller may include a power supply and amplifiers, a processor with memory, and input/output channels for performing signal processing tasks. The reference sensing unit may comprise wired or wireless microphones (308), which can be placed outside the incubator 310 for abating outside noise 311, or alternately on windows for abating environmental noises, or doors for reducing noise from other rooms, or on other known noise sources. The wireless communication unit may include wireless or wired transmitter and receivers (302, 304) for communication purposes.
A general multi-channel ANC system suitable for the embodiment of
x(n)=[xiT(n)x2T(n) . . . xJT(n)]T
with xj(n) is the jth-channel reference of signal of length L. The secondary sources have K channels, or
y(n)=[y1(n)y2(n) . . . yK(n)]T,
where yk(n) is the signal of kth output channel at time n. The error signals have M channels, or
e(n)=[e1(n)e2(n) . . . eM(n)]T
where em(n) is the error signal of mth error channel at time n. Both the primary noise d(n) and the cancelling noise d′(n) are vectors with M elements at the locations of M error sensors
Primary paths impulse responses (402) can be expressed by a matrix as
where p, (n) is the impulse response function from the jth reference sensor to the mth error sensor. The matrix of secondary path impulse response functions (405) may be given by
where smk(n) is the impulse response function from the kth secondary source to the mth error sensor. An estimate of S(n), denoted as Ŝ(n) (401) can be similarly defined.
Matrix A(n) may comprise feed-forward adaptive finite impulse response (FIR) filters impulse response functions (403), which has J inputs, K outputs, and filter order L,
A(n)=[A1T(n)A2T(n) . . . AKT(n)]T, where
A
k(n)=[Ak,1T(n)Ak,2T(n) . . . Ak,JT(n)]T,k=1,2, . . . ,K
is the weight vector of the kth feedforward FIR adaptive filter with J input signals defined as
A
k,j(n)=[ak,j,1(n)ak,j,2(n) . . . ak,j,L(n)]T,
which is the feed-forward FIR weight vector form jth input to kth output.
The secondary sources may be driven by the summation (406) of the feed-forward and feedback filters outputs. That is
The error signal vector measured by M sensors is
where d(n) is the primary noise vector and y′(n) is the canceling signal vector at the error sensors.
The filter coefficients are iteratively updated to minimize a defined criterion. The sum of the mean square errors may be used as the cost function defined as
The least mean square (LMS) adaptive algorithm (404) uses a steepest descent approach to adjust the coefficients of the feed-forward and feedback adaptive FIR filters in order to minimize (n) as follows:
A(n+1)=A(n)−μaX′(n)e(n)
where μa and μb are the step sizes for feedforward and feedback ANC systems, respectively. In another embodiment, different values may be used to improve convergence speed:
that is
The updated adaptive filter's coefficients can be expressed,
and it can be further expended as
In addition to noise reduction, the embodiment of
Under one embodiment, direct-sequence spread spectrum (DS/SS) techniques may be used to conduct wireless communication. In another embodiment; orthogonal frequency-division multiplexing (OFDM) or ultra-wideband (UWB) techniques may be used. For DS/SS communications, each information symbol may be spread using a length-L spreading code. That is,
d(k)=v(n)c(n,l) (7)
where v(n) is the symbol-rate information bearing voice signal, and c(n, l) is the binary spreading sequence of the nth symbol. In one embodiment, c(n) is used instead of c(n, l) for simplicity. The received chip-rate matched filtered and sampled data sequence can be expressed as the product of the chip-rate sequence d(k) and its spatial signature h,
p(k)=d(k)h (8)
Within a symbol interval, after chip-rate processing received data becomes
r=p+w (9)
where the L by 1 vector p contains signal of interest, and w is the white noise
An embodiment for combining/integrating ANC with the aforementioned communications is illustrated in
Using a z-domain notations, Ev(z) can be expressed as
Ev(z)=D(z)−S(z)[Y(z)+V(z)], (10)
Where the actual error signal E(z) may be expressed as
Assuming that the perfect secondary-path model is available, i.e., Ŝ(z)=S(z), we have
E(z)=D(z)−S(z)Y(z). (12)
This shows that the true error signal is obtained in the integrated ANC system, where the voice signal is removed from the signal ev(n) picked up by the error microphone. Therefore, the audio components won't degrade the performance of the noise control filter A(z). Thus, some of the advantages of the integrated ANC system are that (i) it provides audio comfort signal from the wireless communication devices, (ii) it masks residual noise after noise cancellation, (iii) it eliminates the interference of audio on the performance of ANC system, and (iv) it integrates with the existing ANC's audio hardware such as amplifiers and loudspeakers for saving overall system cost.
A multiple-channel ANC system such as the one illustrated in
In addition to the audio signals being transmitted from the infant's incubator, sound analysis (303) can be performed on the emanating audio signal (e.g., cry, coo, etc.) in order to characterize a voice signal. Although it does not have a conventional language form, a baby cry (and similar voice communication) may be considered a kind of speech signal, the character of which is non-stationary and time varying. Under one embodiment, short time analysis and threshold method are used to detect the pair of boundary points-start point and end point of each cry word. Feature extraction of each baby cry word is important in classification and recognition, and numerous algorithms can be used to extract features, such as: linear predictive coding (LPC), Mel-frequency cepstral coefficients (MFCC), Bark-frequency cepstral coefficients (BFCC), and some other frequency extraction of stationary features. In this exemplary embodiment, 10 order Mel-frequency cepstral coefficient (MFCC-10) having 10 coefficients is used as a feature pattern for each cry word. It should be understood by those skilled in the art that other numbers of coefficients may be used as well.
Once features are extracted, different statistical methods can be utilized to effect baby cry cause recognition, such as Gaussian Mixture Model (GMM), Hidden Markov Models (HMM), and Artificial Neural Network (ANN). In one embodiment discussed herein, ANN is utilized for baby cry causes recognition. ANN imitates how human brain neurons work to perform certain task, and it can be considered as a parallel processing network system with a large number of connections. ANN can learn a rule from examples and generalize relationships between inputs and outputs, or in other words, find patterns of data. A Learning Vector Quantization (LVQ) model can be used to implement the classification of multi-class issue. The objective of using LVQ ANN model for baby-cry-cause recognition is to develop a plurality (e.g., 3) feature patterns which represent cluster centroids of each baby-cry-cause: draw attention cry, wet diaper cry, and hungry cry, as an example.
With regards to baby cry classification and recognition techniques, baby cry word boundary points detection may be advantageously employed. A speech signal of comprehensible length is typically a non-stationary signal that cannot be processed by stationary signal processing methods. However, during a limited short-time interval, the speech waveform can be considered stationary. Because of the physical limitation of human vocal cord vibration, in practical applications 10-30 milliseconds (ms) duration interval may used to complete short-time speech analysis, although other intervals may be used as well. A speech signal may be thought of as comprising a voiced speech component with vocal cord vibration and an unvoiced speech component without vocal cord vibration. A cry word can be defined as the speech waveform duration between a start point and an end point of a voiced speech component. Voiced speech and unvoiced speech have different short-time characteristics, which can be used to detect the boundary points of baby cry words.
Short-time energy (STE) is defined as the average of the square of the sample values in a suitable window, which may be expressed as:
where w(m) is the window coefficient correspond with signal sample, and N is window length. The most obvious difference is that voiced speech has higher short-time energy (STE), but unvoiced speech has lower STE. In one embodiment, a Hamming window may be chosen as it minimizes the maximum side lobe in the frequency domain and can be described as:
As previously mentioned, short-time processing of speech may preferably take place during segments between 10-30 ms in length. For a signals of 8 kHz sampling frequency, a window of 128 samples (˜16 ms) may be used. STE estimation is useful as a speech detector because there is a noticeable difference between the average energy between voiced and unvoiced speech, and between speech and silence. Accordingly, this technique may be paired with short-time zero crossing for a robust detection scheme.
Short-time zero crossing (STZC) may be defined as the rate at which the signal changes sign. It can be mathematically described as:
STZC estimation is useful as a speech detector because there are noticeable fewer zero crossings in voiced speech as compared with unvoiced speech. STZC is advantageous in that it is capable of predicting cry signal start and endpoints. Significant short-time zero crossing effectively describes the envelope of a non-silent signal and combined with short-time energy, can effectively track instances of potentially voiced signals that are the signals of interest for analysis.
There are some false positive cries that may be detected, as not all signals bounded by the STZC boundary contain cries. Large STZC envelopes with low energy tended to contain cry precursors such as whimpers and breathing events. Not all signals with non-negligible STE contained cries as well. Infant coughing events may be bounded by a STZC boundary and contained a noticeable STE. In order to consistently pick up desired cry events, a desired cry may be defined as a voiced segment of sufficiently long duration. Two quantifiable threshold conditions that are needed to be met to constitute a desired voiced may be:
Returning back to STE processing, as baby cry signals may be down sampled from 44.1 kHz to 7350 Hz, a window length N may be chosen as 128, which translates to a 17.4 ms short-time interval. In order to detect the boundary points of cry words by setting a proper threshold value, the STE must be normalized into range from 0 to 1 by dividing the maximum STE value of whole duration. To eliminate unvoiced artifact of low STE or very short duration high energy impulse, two quantifiable thresholds should be set to detect the cry word boundary points. Those two threshold conditions are:
Short-time segment of speech can be considered stationary. Stationary feature extraction techniques can be compartmentalized into either cepstral based (taking the Fourier transform of the decibel spectrum) or linear predictor (determining the current speech sample based on a linear combination of prior samples) based algorithms. In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency. In practical application of speech recognition, Mel-frequency cepstral coefficients (MFCC) is considered the best characteristic parameter which is closest to the non-linear low and high frequency perception of human ear.
In sound processing, the mel frequency cepstrum is a representation of the short-time power spectrum of a sound based on a linear cosine transform of a log spectrum on a non-linear mel scale of frequency. The mel scale is a perceptual scale of pitches. It is based upon the human perception of the separation on a scale of pitches. The reference of the mel scale with standard frequency may be defined by 1000 Hz tone 40 dB above the listeners threshold and is equivalent to a pitch of 1000 mels. What the mel frequency cepstrum provides is a tool that describes the tonal characteristics of a signal that is warped such that it better matches human perceptual hearing of tones (or pitches). The conversion between mel (m) and Hertz (f) can be described as
The mel frequency cepstrum may be obtained through the following steps. A short-time Fourier transform of the signal is taken in order to obtain the quasi-stationary short-time power spectrum F(f)=F{f(t)}. The frequency portion of the spectrum is then mapped to the mel scale perceptual filter bank with the equation above using 18 triangle band pass filters equally spaced on the mel range of frequency F(m). These triangle band pass filters smooth the magnitude spectrum such that the harmonics are flattened in order to obtain the envelope of the spectrum with harmonics. This indicates that the pitch of a speech signal is generally not present in MFCC. As a result, a recognition system will behave more or less the same when the input utterances are of the same timbre but with different tones/pitch. This also serves to reduce the size of the features involved, making the classification simpler.
The log of this filtered spectrum is taken and then the Fourier transform of the log spectrum squared results in the power cepstrum of the signal, or
|F{log(|F(m)|2)}|2.
At this point, the discrete cosine transform (DCT)
of the power cepstrum is taken to obtain the MFCC, which may be used to measure audio signal similarity. The DCT coefficients are retained as they represent the power amplitudes of the mel frequency cepstrum. To keep the codebook length similar, an nth (e.g., 10th) order MFCC may be obtained. However, in addition to the MFCC, and in order to have a more similar basis in algorithm for comparison in feature classification, the MFLPCC may be used as well. The power cepstrum may possesses the same sampling rate as the signal, so the MFLPCC is obtained by performing an LPC algorithm on the power cepstrum in 128 sample frames. The MFLPCC encodes the cepstrum waveform in a more compact fashion that may make it more suitable for a baby cry classification scheme.
An exemplary MFCC feature extract procedure is illustrated in
P(k)=|X(k)|2
Again, for this example, the number of subband filters is 10, and P(k) are binned onto the mel scaled frequency using 10 overlapped triangular filter. Here binning means that each P(k) is multiplied by the corresponding filter gain and the results accumulated as energy in each band. The relationship between frequency and Mel scale can be expressed as follows:
The resulting nonlinear Mel frequency curve is illustrated in
where N is the number of DFT points, and M=10.
where MFCC order M is 10.
In one embodiment, a Linear vector quantization (LVQ) neural network model is used. A self organizing neural network has the ability to assess the input patterns presented to the network, organize itself to learn from the collective set of inputs, and categorize them into groups of similar patterns. In general, self-organized learning involves the frequent modification of the network's synaptic weights in response to a set of input patterns. LVQ is such a self organizing neural network model that can be used to classify the different baby cry causes. LVQ may be considered a kind of feed-forward ANN, and is advantageously used in areas of pattern recognition or optimization.
Different baby-cry-causes may be assumed to have different feature patterns; as such, the objective of classification is to determine a general feature pattern that is a kind of MFCC “codebook” from example training feature data for a specific baby cry cause, such as “draw attention” cry, “need to change wet diaper” cry, “hungry” cry, etc. Subsequently the unknown cause baby cry may be recognized by finding out the shortest distance between the input unknown cry word MFCC-10 feature vector and every class “codebook” respectively.
A LVQ algorithm may be used to complete a baby-cry-cause classification, where a plurality of baby-cry-causes may be taken into consideration (e.g., draw attention, diaper change needed, hungry, etc.). Thus, an exemplary LVQ neural network would have a plurality (e.g., 3) output classes which would corresponding to the main baby-cry-causes:
An exemplary LVQ architecture is shown in
X=x
1
x
2
. . . x
10]T
where all the weights in response to the input vector and output classes can be expressed as:
where W1=[w1 1w1 2 . . . w1 10]T represents the pattern “codebook” of draw attention cry, W2=w2 1w2 2 . . . w2 10]T represents the pattern “codebook” of diaper change needed cry, and W3=[w3 1 w3 2 . . . w3 10]T represents the pattern “codebook” of hungry cry.
The exemplary LVQ neural network model may be trained using the follows steps:
and k=1, 2, . . . N,
where N is the number of iteration.
∥X(k)−Wj(k)∥2
is minimal, and
Where C(X(k) is the known class index of input X at time k, for example, if input X(k) is MFCC-10 of a hungry cry word, CX(k)=3. Preferably, only Wj is updated and the updating rule depends on whether the class index of input pattern equals to the index j obtained in Step 4.
The “draw attention cry words,” “diaper change needed cry words,” and “hungry cry words” MFCC-10 features of 4 different babies are illustrated in
In another embodiment, linear predictive coding (LPC) may be utilized to obtain baby cry characteristics. In certain cases, the waveforms of two similar sounds will also show similar characteristics. If two infant cries have very similar waveforms, it stands to reason that they should possess the same impetus. However, it is impractical to conduct a sample by sample full comparison between cry signals due to the complexity inherent in having audio signals of around 1 second in length at a sampling rate of 8 kHz. In order to improve the solution of the time domain comparison of infant cry signals, linear predictive coding (LPC) is applied.
As mentioned previously, there may be two acoustic sources associated with voiced and unvoiced speech, respectively. Voiced speech is caused by the vibration of the vocal cords in response to airflow from the lung and this vibration is periodic in nature while unvoiced speech is caused by constrictions in the air tract resulting in random airflow. The basis of the source-filter model of speech is that speech can be synthesized by generating an acoustic source and passing it through an all-pole filter. The linear predictive coding (LPC) algorithm produces a vector of coefficients that represent a spectral shaping filter. An input signal to this filter is either a pitch train for voiced sounds, or white noise for unvoiced sounds. This shaping filter may be an all-pole filter represented as:
where {ai} are the linear prediction coefficients and M is the number of poles (the roots of the denominators in the z transform). A present sample of speech may be represented as a linear combination of the past M samples of the speech such that:
where {circumflex over (x)}(n) is the predicted value of x(n).
The error between the actual and predicted signal can be defined as
The smaller the error, the better the spectral shaping filter is at synthesizing the appropriate signal. Taking the derivative of the above equation with respect to ai and equating to 0 yields:
Minimization of error yields sets of linear equations in the form of the error between the actual and predicted signal, expressed above. To obtain the minimum mean square error, an autocorrelation method where the minimum is found by applying the principle of orthogonality as the predictor coefficients that minimize the prediction error must be orthogonal to the past vectors.
This can be achieved by using a Toeplitz autocorrelation matrix R to find the LPC parameters and using the Levinson-Durbin recursion to solve the Toeplitz matrix.
Effectively, the purpose of LPCC is to take a waveform of a large size in unit samples and then compress it into a more manageable form. Because similar waveforms should also result in similar acoustic output, LPC serves as a time domain measure of how close two different waveforms are.
Because of the sampling rate of 8 kHz and the generalization that f/1000+2 LPC coefficients are the minimum required to decompose a waveform, 10 LPCC or LPC-10 may be used to describe each 128 sample frame which corresponds to 16 ms and is assumed to be short-time stationary. Instead of computing the difference between windowed segments of 128 samples in length, only comparisons of segments of the LPC-10 values are needed. Furthermore, during signal preprocessing, a first order low pass filter can be used to brighten the signal such that components due to non-vocal tract speech can be attenuated.
In another embodiment, cepstrum analysis may be used to obtain baby cry characteristics. To obtain the frequency spectrum F(w), a Fourier transform, denoted by F{ }, must be performed on the time domain signal f(t) as F(w)=F{f(t)}. However, it is possible to take the Fourier transform of the log spectrum as if it were a signal as well. The result of this transformation moves one from the frequency spectrum domain to the power cepstrum domain described by
|F{log(|F{f(t)}|2)}|2.
The cepstrum provides information about the rate of change in the different spectrum bands. This attribute can be exploited as a pitch detector. For example, if the sampling rate of a cry signal is 8 kHz and there is a large peak in the spectrum where the quefrency (x-axis frequency analog in spectrum domain) is 20 samples, the peak indicates the existence of a pitch of 8000/20=400 hz. This peak occurs in the cepstrum because the harmonics in the spectrum are periodic, and the period corresponds to the pitch.
Cepstrum pitch determination is particularly effective because the effects of the vocal excitation (pitch) and vocal tract (formants) are additive in the logarithm of the power spectrum and thus clearly separate. This trait makes cepstrum analysis of audio signals more robust than processing normal frequency or time domain samples. Another technique used to improve the accuracy of feature extraction of cepstrum based techniques is liftering. Liftering applies a low order low pass filter to the cepstrum in order to smooth it out and help with the Discrete Cosine Transform (DCT) analysis for feature extraction techniques in ensuing sections. Additionally, linear predictive cepstral coefficients (LPCC) may be used for audio feature extraction. LPCCs may be obtained by applying linear predictive coding on the cepstrum. As mentioned above, the cepstrum is a measure of the rate of change in spectrum bands over windowed segments of individual cries. Applying LPC to the cepstrum yields a vector of values for a 10-tap filter that would synthesize the cepstrum wave form.
Similar to the MFCC, the bark frequency cepstral coefficients (BFCC) warps the power cepstrum such that it matches human perception of loudness. The methodology of obtaining the BFCC is similar to that of the MFCC except for two differences. The frequencies are converted to bark scale according to:
where b denotes bark frequency and f is frequency in hertz. The mapped bark frequency is passed through a plurality (e.g., 18) of triangle band pass filters. The center frequencies of these triangular band pass filters correspond to the first 18 of the 24 critical frequency bands of hearing (where the band edges are at 20, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000 and 15500 Hz). This is done because frequencies above 4 kHz may be attenuated by the low pass anti-aliasing filter described in signal preprocessing. This also allows for a more comparable comparison between the MFLPCC and BFLPCC later on.
The BFCC is obtained by taking the DCT of the bark frequency cepstrum and the 10 DCT coefficients describe the amplitudes of the cepstrum. The power cepstrum also possesses the same sampling rate as the signal, so the BFLPCC is obtained by performing the LPC algorithm on the power cepstrum in 128 sample frames. The BFLPCC encodes the cepstrum waveform in a more compact fashion that may make it more suitable for a baby-cry classification scheme.
In another exemplary embodiment, Kalman filters may be utilized for baby voice feature extraction. One characteristic of analog generated sources of noise is that no two signals are identical. As similar as two sounds may be, they will inherently vary to some degree in pitch, volume and intonation. Regardless, it can be said that adjoining infant cries are highly similar and most likely have the same meaning. In order to estimate the true cry from the recorded cries, Kalman filter formulation may be used.
If x(n) is arranged as an AR(p) (auto-regressive process of order p), it may be generated according to
Supposing that x(n) is measured in the presence of additive noise, then
y(n)=x(n)+v(n) (B)
If we let x(n) be the p-dimensional state vector
then (A) and (B) can be expressed in terms of x(n) as
Equations (C) and (D) can be simplified using matrix notation:
x(n)=Ax(n−1)+w(n)
y(n)=cTx(n)+v(n) (E)
where A is a p×p state transition matrix, w(n)=[w(n), 0, . . . , 0]T is a vector noise process and c is a unit vector of length p. Even though it is applicable primarily in stationary AR(p) processes, (D) can be generalized to a non-stationary process by letting x(n) be a state vector of dimension p that evolves according to the difference equation
x(n)=A(n−1)x(n−1)+w(n)
where A(n−1) is a time varying p×p state transition matrix and w(n) is a vector of zero-mean white noise processes and let y(n) be a vector of observations that are formed according to
y(n)=C(n)x(n)+v(n)
where y(n) is a vector of length q, C(n) is a time varying q×p matrix and v(n) is a vector of zero mean white noise processes that are statistically independent of w(n).
It can be appreciated by those skilled in the art that the present disclosure provides innovative systems, apparatuses and methods for electronic devices that integrate active noise control (ANC) techniques for abating environmental noises, with a communication system that communicates to and from an infant. Such configurations may be advantageously used for infant incubators, hospital beds, and the like. The wireless communication system can also provide communication between infants to their parents/caregivers/nurses, patients/family members/nurses/physicians, and also provide intelligent digital monitoring hat provide non-invasive detection and classification of infant's audio signals/other audio signals.
In the foregoing Detailed Description, it can be seen that various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment.
This application is a continuation of U.S. patent application Ser. No. 13/837,242, titled “Apparatus, System and Method for Noise Cancellation and Communication For Incubators and Related Devices,” filed on Apr. 23, 2013, which is a continuation-in part of U.S. patent application Ser. No. 13/673,005, titled “Encasement for Abating Environmental Noise, Hand-Free Communication and Non-Invasive Monitoring and Recording,” filed on Nov. 9, 2012, which is a continuation of U.S. patent application Ser. No. 11/952,250 (now U.S. Pat. No. 8,325,934), titled “Electronic Pillow for Abating Snoring/Environmental Noises, Hands-Free Communications, And Non-Invasive Monitoring And Recording,” filed Dec. 7, 2007. The disclosures set forth in the referenced applications are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | 13837242 | Mar 2013 | US |
Child | 14965176 | US | |
Parent | 11952250 | Dec 2007 | US |
Child | 13673005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13673005 | Nov 2012 | US |
Child | 13837242 | US |