The present invention relates to a system and method for a user wearing a headset to be aware of an outer sound environment while listening to music or any other audio source.
Voice activity detection (VAD), also known as speech activity detection or speech detection, is a technique used in speech processing in which the presence or absence of human speech is detected. Various VAD algorithms are known. Conventional algorithmic solutions used for VAD are known to suffer from the problem of a poor detection score when the input signal is noisy.
VAD plays a role in many speech processing applications including speech recognition, speech compression and noise reduction systems. In
A criterion to detect speech is to look for voiced parts as those are periodic and have a mathematical well-defined structure that can be used in an algorithm. Another approach is to use a statistical model for speech, estimate its parameters from acquired data samples and use the classic results of decision theory to get to the frame speech/noise classification.
It is known that VAD performance decreases when an amount of noise increases. Conventional solutions are to have the VAD system preceded by a noise reduction (NR) module. One known limitation when pre-processing a speech signal with noise reduction (NR) is the potential appearance of musical noise which added to the input signal may mislead the VAD module and creates false detections.
Another drawback with the use of conventional NR modules is the difficulty and even the impossibility to set internal parameters to allow the system to work correctly for different noise levels and categories. As an example, if one chooses a set of internal parameters to tackle a very noisy environment, then relatively important distortions will appear in silent and quiet environments.
To overcome the above drawbacks which not only impact the audio quality but may even harm the VAD module performance, it is desirable to provide an improved mechanism for detecting a noise level environment and allow the dynamic setting of the NR internal parameters.
It is desirable to provide an improved noise-robust VAD method and a system for allowing a user to be aware of an outer sound environment while listening to music or any other audio source.
The present invention relates to a voice aware audio system and a method for a user wearing a headset to be aware of an outer sound environment while listening to music or any other audio source. The present invention relates to a concept of an adjustable sound awareness zone which gives the user the flexibility to avoid hearing far distant voices. The system of the present invention can use features of a headphone as described in US Patent Publication Number 2016/0241947 hereby incorporated by reference into this application. In one embodiment, the headphone includes a microphone array having four input microphones. This provides spatial sound acquisition selectivity and allows the steering of the microphone array towards directions of interest. Using beamforming methods and combining with different technologies like noise reduction systems, fractional delay processing and a voice activity detection (VAD) algorithm of the present invention, a new audio architecture is provided with improved performance in noisy environments.
The present invention includes different signal processing modules including noise reduction and array processing. In particular, a procedure is provided which estimates the noise level which is referred to as Noise Sensing (NS). This procedure adapts parameters of a noise reduction so that output sound quality is optimized. Once voice has been detected, the user can be alarmed via a headphone signal without disrupting the music or other audio source that the user was listening to. This is done by mixing the external voice with the headphone lead signal. A mixing mechanism is used which can take into account psychoacoustic properties and allow final mixing without reducing a volume of the music signal while maximizing at the same time intelligibility.
Typical applications of the voice awareness audio system of the present invention can appear within the following scenarios: voice, for example a person shouting, talking or calling, a baby crying, public transport announcements; bells and alarms, for example someone ringing a door bell, a door bell activated for a package delivery, house, car and other alarms; and others, for example a car horn, police and ambulance air-raid siren, and whistles The invention will be more fully described by reference to the following drawings.
Reference will now be made in greater detail to a preferred embodiment of the invention, an example of which is illustrated in the accompanying drawings. Wherever possible, the same reference numerals will be used throughout the drawings and the description to refer to the same or like parts.
The voice aware audio system of the present invention allows any user wearing a headphone to be aware of the outer sound environment while listening to music or any other audio source. In one embodiment, the voice aware audio system can be implemented as a headphone which has 4 input microphones as described for example in US Patent Publication No. 2016-0241947. The user will be prompted by hearing a voice or a set of defined sounds of interest when the signal coming from the headphone microphone is recognized to be a desired signal. When the signal coming from the microphone is not analyzed to be a voice or any signal of interest, the listener will not be disrupted by the microphone signal and will just hear the lead signal.
A sub-system called Adjustable Sound Awareness Zone (ASAZ) can be used with voice aware audio system 10 as depicted in
Referring to
In one embodiment, headphone 12 has one or more omni-directional microphones 15 located in ear pads 14. Headphone 12 can include four omni-directional microphones 15 as shown in
In an implementation of block 22, Wiener entropy or spectral flatness is used in order to reduce the computational burden of the two successive procedures. The FFT of the input buffer can also be used for noise reduction as described below.
In an implementation of block 24 a pitch estimation algorithm is used. In one embodiment, the pitch estimation algorithm is based on a robust YIN algorithm. The estimation process can be simplified into a detection-only process or the complete algorithm can be used ensuring continuity of the estimated pitch values between successive frames to render the algorithm even more robust against errors.
Successive decisions over subframes in a frame plus overlapping between the large frame provides an increase in the accuracy of the algorithm, referred to as the WEYIN (Wiener Entropy YIN) algorithm.
In one embodiment for VAD, the method can be done with different combinations of features in frequency domain in block 22 to detect potential pitch voiced frames candidates that will be re-analyzed in time-domain in block 24.
The Wiener entropy given as:
can be computed using:
This leads to the following equation:
The Wiener entropy can be computed in different bands Bi, i=1, . . . , L. So that, the candidate selection process is done through the computation of the L scalar quantities:
Which are sent to the selection process after a threshold decision step:
Bi(k)ηi, i=1, . . . ,D.
Once the frame has been designed as a candidate for speech presence, the time-domain inspection begins in block 24. The YIN algorithm can be used over K subframes of length M such that:
N=KM,
where:
N=2L,
is the frame length used in the spectrum-domain and chosen to be a power of 2, in order to be able to use the FFT.
Yin algorithm is turned from a pitch estimation algorithm to a pitch detection one. For that purpose, a frequency band [Fpmin, Fpmax] is defined corresponding to minimum and maximum expected pitch frequency values which leads to the time values interval [τmin,τmax]:
τmin=└Fs/Fmaxp┘ and τmax=┌Fs/Fminp┐,
where Fs is the sampling frequency which can be a fraction of the original sampling frequency used for the processing in the frequency domain, └ ┘ and ┌ ┐ are respectively the floor and ceiling rounding operators. As an example, if [Fminp, Fmaxp]=[70, 400] Hz and Fs=8 kHz, then [τmin, τmax]=[20, 115].
The following matrix of time delays lags is defined:
where is the rounding to the nearest integer operator and (0:m)=(0 1 2 . . . m−1 m). the example above is reconsidered:
With this choice, computations of the YIN difference function will be done according to the lag values of the first and second rows of the matrix Δ. First column of this matrix will give the relative indices from which the difference function computation departs.
Over the present frame, a set of difference function values is defined taken over successive intervals of length H. They are organized in a matrix with number of rows and columns defined as:
YIN difference matrix dd is defined by its generic element as:
Consider then:
And the quantity:
The algorithm resumes by computing:
And looks for the minimum:
rr(i)=min(Dn(τmin:τmax)),
which is compared to a threshold:
rr(i)φ.
If this minimum is smaller than the threshold, decision of speech presence βi=1 for subframe i is taken.
Once decisions are done on the successive K subframes in the present frame, it is decided for the speech presence over the complete frame by proceeding to a majority vote:
where Q may be chosen (but not restricted to) to be K/2.
In one embodiment a Wiener entropy simplification can be used in block 22.
In order to avoid the square root vectorial operation: |Xf(l, k)|=√2Xf(l, k)+2 Xf(l, k) which can be costly, are chosen to use:
where:
Sf(l,k)=2Xf(l,k)+2Xf(l,k)=|Xf(l,k)|2.
In one embodiment, a Yin simplification can be used in block 24.
For the time-domain part, the following YIN version can be used:
In this last equation, the squared difference function is replaced by the absolute value in order to reduce the number of operations.
There exists an overlap of J samples between two successive frames (decision of speech presence is valid for the J first samples only).
If rk(i+1) is the k th row of the matrix ddi+1 at time i+1, then we have:
where rm(i+1) is the m th row of the matrix ddi+1 and ddi(2:nRows, :) is the extracted matrix from dd associated to the present frame i, from row 2 to nRows.
From the previous equation, we deduce easily:
Therefore, there is no need to compute all the elements of the matrix dd before computing the sum of its rows. Instead, the vector Dd(i) is updated by computing rnRows(i) and nnRows(i).
Referring, to
In this method or algorithm decVad is denoted the input decision coming from the speech decision module 40 shown in
In one embodiment, noise Sensing (NS) architecture 50 optimizes for all possible noise levels to provide noise reduction (NR) audio quality output while preventing as much as possible, the apparition of the musical noise. Output 51 of noise sensing (NS) can be used in adaptive noise reduction (NR) module 70 as depicted in
Output from distance computation module 76 is used in hangover decision module 77. In order to control the frequency of switching between noise levels states, three noise levels states have been defined as noise, intermediary and no noise which are determined in hangover decision module 77 such that voice aware audio system 10 is not switched over for sudden or impulsive noises. Adaptive noise reduction module 78 processes the signal from hangover decision module 77 to reduce noise. Both raw signal G1 80 and processed signal 82 G2 are mixed in mixer 84 to provide clean signal 85 and transmitted to voice activity determination (VAD) architecture system 30 with the adaptive convex linear combination:
y=G1x1+(1−G1)x2,
where x1 is the raw microphone input, x2 is the NR module output and y is the input of the VAD module.
G1 depends on the root mean square (RMS) value ξ which can be computed either in a time or frequency domain.
NR algorithms can be adjusted and their corresponding internal setting parameters with the objective to limit musical noise and audio artefacts to the minimum while reducing ambient noise to the maximum.
In one embodiment, voice aware audio system 10 can include headphone 12 having a microphone array and for example a four-channel procedure. An advantage of multiple channel procedure is that it brings innovative features that increase the efficiency. Because a speaker is localized in space, the propagation of its voice sound to the microphone array follows a coherent path, in opposition to diffuse noise. Typically, the voice picked up on one microphone is a delayed replica of what is recorded on a second microphone.
The microphone array can act as a spatial filter to attenuate sounds coming from non-desired directions while enhancing sounds coming from the selected one(s). The use of a microphone array can help to improve sound quality and/or increase VAD noise robustness and detection accuracy.
Once voice is detected in one direction at one of microphones 15 in microphone array 100, localization module 102 localizes a speaker direction of arrival. Beamforming module 104 steers the microphone detecting the voice towards the determined direction and consequently, attenuates noise coming from other directions. Beamforming module 104 provides an enhanced voice signal delivered to speakers 17 of headphone 12 as shown in
In an alternate embodiment, noise is coming from all directions. For example, noise can occur in all directions in a train, plane, boat and the like where noise is mainly due to the motor engine with no precise direction of arrival because of the cabin sound reverberation. Conversely, a speaker of interest, is always located in a single point of space. Reverberation is rarely a problem because of the proximity of the speaker for example a few meters max.
It is preferable that voice awareness audio system 10 ensures high intelligibility. As the user is interrupted by an external voice, it is desirable to keep the music level constant and add the external voice while ensuring the user hears clearly the voice message. This advantage can be achieved by controlling both voice false alarms detections and listening conditions. Voice false alarms can be determined voice activity detection architecture system 30. In one embodiment, the present invention provides mixing external speech detected by voice activity detection architecture system 30 with music coming from headphone 12 as shown in
It is desirable to ensure the speaker voice delivered by headphones 12 is well understood by the user. In one embodiment muting or at least reducing music sound level while speech is detected and transmitted. Mixing strategies for improving the voice intelligibility can include adaptive spectral equalization; spatial dissociation; and studio-inspired ad-hoc processing which can be processed separately or together.
Listening to a speech signal mixed with music drastically decreases its intelligibility, especially when music already contains vocal signal. There is evidence from many sources that increasing the signal-to-noise ratio (SNR) onto speech fundamental frequency increases the speech understanding. By extension, the higher the SNR for all the harmonics, the better.
In the present invention spectral and temporal information for both voice coming from voice activity detection (VAD) architecture system 30 and music played by the user in headphone 12 are available. In one embodiment, energy of both signals can be compared, especially in the fundamental frequency and associated harmonics bands, and the signals from voice activity detection (VAD) architecture system 30 are increased if they are relatively low when compared to music.
This allow user 401 to separate sound signals in space. As shown in
In block 501, compression reduces inter-phoneme intensity differences, so that the temporal masking is reduced and speech loudness is increased. The summation of both compressed and original voice signals ensure the voice still sounds natural. Block 502 brings more harmonics. It is known for example that fundamental frequency (F0), as well as F1 and F2 harmonic informations are critically important for vowel identification and consonant perception. Block 5033 aims at cleaning the voice signal by removing low frequency noise and increase frequency bands of interest, for example: low cut −18 dB/octave up to 70 Hz, −3 dB around 250, −2 dB around 500 Hz, +2.5 dB around 3.3 kHz and +7 dB around 10 kHz.
The noise-robust VAD method or algorithm of the present invention uses a select-then-check strategy approach. First step is done in the frequency domain with a relatively large input buffer which allows to reduce the impact of noise. Voiced speech signal presence is detected via a multiband Wiener entropy feature and shown how computational complexity can be reduced without harming the properties of the classic Wiener entropy.
Second part of the algorithm is done in the time domain with a simplified version of the YIN algorithm where pitch estimation has been replaced by its simple detection. In order to reduce further the computational complexity, an absolute value difference is used instead of the classical squared difference. This algorithm runs over successive subframes along the total input frame.
The present invention provides a derivation of an adjustable sound awareness zone system: Using the amplitude of the input signal and some features that help to distinguish between the user and distant external voices, the system allows the user to define a spherical area around his head where normal voices can be taken into account by the VAD algorithm. If a user is talking with a normal voice volume outside of this sphere then the system will reject it.
The present invention provides derivation of a noise sensing system.
The noise reduction method or algorithm as well as the other main modules like VAD and the array processing algorithms may suffer from the fact that their internal settings can't handle easily all the possible noise levels from quiet situations to very noisy ones. To improve the performances of our system, a noise sensing mechanism of the present invention is derived and it is shown how its integration in the system of the present invention improves significantly the performances of the noise reduction and the VAD algorithms. Indeed, the noise sensing allows a reconfigurable algorithmic architecture with self-adjustable internal parameters including the following inter-actively related modules: VAD; Noise reduction; Voice localization and Beamforming using a microphone array system; and Computational complexity reduction of different algorithms.
The present invention shows how computational complexity burden can be significantly reduced. This either reduces the power consumption or gives more room for further processing. The present invention provides derivation of audio mixing schemes which is done under the constraints of keeping the music volume constant while increasing the voice intelligibility.
Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components, including hardware processors. Embodiments of the present invention may be implemented in connection with a special purpose or general purpose processor device that include both hardware and/or software components, or special purpose or general purpose computers that are adapted to have processing capabilities.
Embodiments may also include physical computer-readable media and/or intangible computer-readable media for carrying or having computer-executable instructions, data structures, and/or data signals stored thereon. Such physical computer-readable media and/or intangible computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such physical computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, other semiconductor storage media, or any other physical medium which can be used to store desired data in the form of computer-executable instructions, data structures and/or data signals, and which can be accessed by a general purpose or special purpose computer. Within a general purpose or special purpose computer, intangible computer-readable media can include electromagnetic means for conveying a data signal from one part of the computer to another, such as through circuitry residing in the computer.
When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, hardwired devices for sending and receiving computer-executable instructions, data structures, and/or data signals (e.g., wires, cables, optical fibers, electronic circuitry, chemical, and the like) should properly be viewed as physical computer-readable mediums while wireless carriers or wireless mediums for sending and/or receiving computer-executable instructions, data structures, and/or data signals (e.g., radio communications, satellite communications, infrared communications, and the like) should properly be viewed as intangible computer-readable mediums. Combinations of the above should also be included within the scope of computer-readable media.
Computer-executable instructions include, for example, instructions, data, and/or data signals which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Although not required, aspects of the invention have been described herein in the general context of computer-executable instructions, such as program modules, being executed by computers, in network environments and/or non-network environments. Generally, program modules include routines, programs, objects, components, and content structures that perform particular tasks or implement particular abstract content types. Computer-executable instructions, associated content structures, and program modules represent examples of program code for executing aspects of the methods disclosed herein.
Embodiments may also include computer program products for use in the systems of the present invention, the computer program product having a physical computer-readable medium having computer readable program code stored thereon, the computer readable program code comprising computer executable instructions that, when executed by a processor, cause the system to perform the methods of the present invention.
It is to be understood that the above-described embodiments are illustrative of only a few of the many possible specific embodiments, which can represent applications of the principles of the invention. Numerous and varied other arrangements can be readily devised in accordance with these principles by those skilled in the art without departing from the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5257420 | Byrne, Jr. | Nov 1993 | A |
6888950 | Siskin et al. | May 2005 | B2 |
7970159 | Kleinschmidt et al. | Jun 2011 | B2 |
RE43872 | Trip et al. | Dec 2012 | E |
8340058 | Vedurmudi | Dec 2012 | B2 |
20040156012 | Jannard et al. | Aug 2004 | A1 |
20060205349 | Passier | Sep 2006 | A1 |
20070042762 | Guccione | Feb 2007 | A1 |
20070160249 | LeGette et al. | Jul 2007 | A1 |
20080090524 | Lee et al. | Apr 2008 | A1 |
20080096531 | McQuaide et al. | Apr 2008 | A1 |
20080157991 | Raghunrath et al. | Jul 2008 | A1 |
20080175403 | Tan | Jul 2008 | A1 |
20080177972 | Tan | Jul 2008 | A1 |
20080181419 | Goldstein et al. | Jul 2008 | A1 |
20080201138 | Visser et al. | Aug 2008 | A1 |
20080212791 | Asada | Sep 2008 | A1 |
20090097672 | Bull et al. | Apr 2009 | A1 |
20090109940 | Vedurmudi | Apr 2009 | A1 |
20090186668 | Rahman et al. | Jul 2009 | A1 |
20090208923 | Gelfand | Aug 2009 | A1 |
20090209304 | Ngia | Aug 2009 | A1 |
20090257615 | Bayer, Jr. | Oct 2009 | A1 |
20100040240 | Bonanno | Feb 2010 | A1 |
20100048134 | McCarthy | Feb 2010 | A1 |
20100166243 | Siskin | Jul 2010 | A1 |
20100279608 | Shi-En | Nov 2010 | A1 |
20100296668 | Lee et al. | Nov 2010 | A1 |
20100299639 | Ramsay | Nov 2010 | A1 |
20100308999 | Chornenky | Dec 2010 | A1 |
20110288860 | Schevciw | Nov 2011 | A1 |
20120082335 | Duisters | Apr 2012 | A1 |
20120120270 | Li et al. | May 2012 | A1 |
20120237053 | Alam et al. | Sep 2012 | A1 |
20130038458 | Toivola et al. | Feb 2013 | A1 |
20130108071 | Huang et al. | May 2013 | A1 |
20130124204 | Wong | May 2013 | A1 |
20130148818 | Yamkovoy | Jun 2013 | A1 |
20130208923 | Suvanto | Aug 2013 | A1 |
20130279705 | Wong | Oct 2013 | A1 |
20130279715 | Tan | Oct 2013 | A1 |
20130316642 | Newham | Nov 2013 | A1 |
20130322424 | Fraser | Dec 2013 | A1 |
20130339859 | Hardi | Dec 2013 | A1 |
20140126735 | Gauger, Jr. | May 2014 | A1 |
20140133669 | Klinghult | May 2014 | A1 |
20140143343 | Edholm | May 2014 | A1 |
20140185828 | Helbling | Jul 2014 | A1 |
20140198778 | Fraser | Jul 2014 | A1 |
20140269425 | Fisher et al. | Sep 2014 | A1 |
20140270228 | Oishi | Sep 2014 | A1 |
20150117659 | Kirsch | Apr 2015 | A1 |
20150249898 | Horbach | Sep 2015 | A1 |
20150287422 | Short | Oct 2015 | A1 |
20150294662 | Ibrahim | Oct 2015 | A1 |
20160125869 | Kulavik | May 2016 | A1 |
20160150575 | Andersen et al. | May 2016 | A1 |
20160165336 | Di Censo et al. | Jun 2016 | A1 |
20160241947 | Degraye et al. | Aug 2016 | A1 |
20170142511 | Dennis | May 2017 | A1 |
Number | Date | Country |
---|---|---|
2528177 | Dec 2002 | CN |
101142797 | Mar 2008 | CN |
101640552 | Feb 2010 | CN |
102893331 | Jan 2013 | CN |
103414982 | Nov 2013 | CN |
103686516 | Mar 2014 | CN |
104053253 | Sep 2014 | CN |
2003-023479 | Jan 2003 | JP |
2009-135960 | Jun 2009 | JP |
2012-039624 | Feb 2012 | JP |
2012-524917 | Oct 2012 | JP |
10-0782083 | Dec 2007 | KR |
10-2009-0103953 | Oct 2009 | KR |
200937196 | Sep 2009 | TW |
2008130328 | Oct 2008 | WO |
2015134333 | Sep 2015 | WO |
2016209295 | Dec 2016 | WO |
Entry |
---|
European Patent Application No. 15873755.1, Partial Search Report dated Aug. 8, 2018. |
European Patent Application No. 15873755.1, Search Report dated Jan. 2, 2019. |
International Application No. PCT/US2015/000164, International Search Report dated Apr. 22, 2016. |
International Application No. PCT/US2015/000164, Written Opinion dated Apr. 22, 2016. |
Japanese Patent Application No. 2017-552787, Search Report dated Feb. 3, 2020, with English translation, 11 pages. |
Canadian Patent Application No. 2,971,147, Search Report dated Apr. 6, 2020, 3 pages. |
Sahidullah et al, “Comparison of Speech Activity Detection Techniques for Speaker Recognition”, Oct. 1, 2012 (retrieved from https://arxiv.org/pdf/1210.0297.pdf, May 21, 2019) (7 pages). |
Number | Date | Country | |
---|---|---|---|
20190251955 A1 | Aug 2019 | US |
Number | Date | Country | |
---|---|---|---|
62595627 | Dec 2017 | US |