Hearing devices may determine whether voice (or speech) is included in an audio signal, e.g. by applying a voice activity detector. However, voice often originate from wanted and unwanted sources at the same time thereby making it difficult to distinguish between the wanted and unwanted voice signals and to attenuate the unwanted voice signal. Accordingly, it is preferable to be able to attenuate voice from unwanted sources while enhancing voice from wanted sources.
The present application relates to hearing devices, e.g. hearing aids or headsets, in particular to noise reduction in hearing devices. The disclosure specifically relates to applications wherein a good (high quality) estimate of the voice of the user wearing the hearing device (or hearing devices) is needed, e.g. for transmission to another device, e.g. to a far-end communication partner or listener, and/or to a voice interface, e.g. for voice-control of the hearing device (or other devices or systems).
In an aspect of the present application, a hearing device is disclosed. The hearing device may be adapted for being located at or in an ear of a user, or for being fully or partially implanted in the head of a user.
The hearing device may comprise an input unit for providing at least one electric input signal representing sound in an environment of the user. Environment may refer to the free space surrounding the user stationary and/or dynamically depending on whether the user is standing still or moving around, and which contains audio (e.g. sound) that arrives at the location of the user. For example, an environment may refer to a closed room in which the user is located, or to the open space surrounding the user if the user is located outside e.g. a building.
The electric input signal may comprise a target speech signal from a target sound source and additional signal components, termed noise signal components, from one or more other sound sources. Target sound source may refer to one or more sound sources, such as one or more persons (e.g. the user of the hearing device and/or other persons) or electronic devices (e.g. tv, radio, etc.), which generate and/or emit speech signals, and which are wanted for the user to hear. One or more other sound sources may, for example, refer to one or more persons, electronic devices, or other (e.g. instruments, animals, etc.), which generates and/or emits additional signal components, termed noise signal components, and which are considered to be unwanted signal components for the user and, preferably, should be attenuated.
The hearing device may comprise a noise reduction system for providing an estimate of said target speech signal.
The noise signal components may be at least partially attenuated.
The hearing device may comprise an own voice detector for repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech originating from the voice of the user.
The hearing device may be further configured to provide that said noise signal components are identified during time segments.
The own voice detector may indicate that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user, or originates from the voice of the user with a probability above an own voice presence probability (OVPP) threshold value.
Thereby, noise signal components, which may also comprise voice from unwanted sound sources, may be detected during time intervals where the own voice detector estimates that the user is speaking, e.g. instead of or in addition to during time intervals of no voice activity (as is customary in the art). Thus, the noise signal components for being attenuated may be updated also while the user is speaking. For example, if a person is speaking during the same time segment as the user is speaking, sound from the person may be identified and labelled as noise, which should be attenuated.
Further, identifying noise signal components by use of own voice detection eliminates the need for additional detectors (e.g. a camera) dedicated to e.g. identify by image analysis whether a specific person is the source of unwanted noise, as he/she is speaking at the same time segment as the user.
Accordingly, an improved noise reduction may be allowed for.
The input unit may comprise a microphone. The input unit may comprise at least two microphones. The input unit may comprise three or more microphones.
Each microphone may provide an electric input signal. The electric input signal may comprise said target speech signal and said noise signal components.
The hearing device may comprise a voice activity detector for repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech.
Thereby, speech included in the at least one electric input signal may be enhanced.
The hearing device may comprise one or more beamformers. For example, the beamformer filter may comprise two or more beamformers.
The input unit may be configured to provide at least two electric input signals connected to the one or more beamformers. The one or more beamformers may be configured to provide at least one beamformed signal.
The one or more beamformers may comprise one or more own voice cancelling beamformers configured to attenuate signal components originating from the user's mouth, while signal components from (e.g. all) other directions are left unchanged or attenuated less.
The one or more beamformers may comprise one or more target beamformers for enhancing the voice of the target sound source (relative to sounds from other directions than a direction to the target sound source).
The target signal may be assumed to be the user's own voice.
The one or more beamformers may comprise an own voice beamformer configured to maintain signal components from the user's mouth while attenuating signal components from (e.g. all) other directions. The own voice beamformer may be determined in advance of operation of the hearing device (e.g. during a fitting procedure), and corresponding filter weights may e.g. be stored in a memory of the hearing device. Acoustic transfer functions from the user's mouth to each microphone of the hearing device (or devices) may e.g. be determined in advance of operation of the hearing device, either using a model (e.g. a head and torso model, e.g. HATS, Head and Torso Simulator 4128C from Brüel & Kjær Sound & Vibration Measurement A/S), or measuring on one or more persons, e.g. including the user. Absolute or relative acoustic transfer functions may be represented by a look vector d=(d1, . . . , dM), where each element represent a (absolute or relative) transfer function from the mouth to a specific one of the M microphones. One of the microphones may be defined as a reference microphone, and relative transfer functions may be defined from the reference microphone to the rest of the microphone of the hearing device (or hearing system). Own voice filter weights WOV may be determined in advance of or during operation of the hearing device. The own voice filter weights are a function of the look vector dOV(k), a noise covariance matrix estimate Ĉv(k,n) and an inter-microphone covariance matrix Cx(k,n) for noisy microphone signals, where k and n are frequency and time indices, respectively. The calculation of the filter weights for a given type of beamformer (e.g. an MVDR beamformer) is customary in the art and e.g. exemplified in the detailed description of embodiments of the present disclosure.
The beamformer may comprise a minimum variance distortionless response (MVDR) beamformer.
The beamformer may comprise a multichannel Wiener filter (MWF) beamformer.
The beamformer may comprise a MVDR beamformer and a MWF beamformer.
The beamformer may comprise a MVDR filter followed by a single-channel post filter.
For example, the beamformer may comprise a MVDR beamformer and a single channel post Wiener filter.
An advantage of using an MVDR filter is that it does not distort target components.
An advantage of using an MWF filter is that it maximizes broadband signal-to-noise ratio (SNR).
The noise signal components may be represented by a noise covariance matrix estimate.
The noise covariance matrix may he based on cross power spectral densities (CPSDs) of the noise signal components.
Thereby, a compact (mathematically tractable) description of a noise field is provided.
The hearing device may comprise a beamformer filter comprising a number of beamformers.
The noise covariance matrix may be updated when said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user.
The noise covariance matrix may be updated when said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user with a probability above said OVPP-threshold value.
Thereby, voice from a (competing) speaker (unwanted speech) not being of (current) interest to the user and/or disturbing the speech of the user may be attenuated.
The noise signal components may additionally be identified during time segments wherein said voice activity detector indicates an absence of speech in the at least one electric input signal, or a signal derived therefrom.
The noise signal components may be identified during time segments wherein said voice activity detector indicates no speech, or a presence of speech with a probability below a speech presence probability (SPP) threshold value.
The hearing device may be configured to estimate said noise signal components using a Maximum Likelihood estimator.
Thereby, the noise covariance matrix estimate that best “explains” (has maximum likelihood) the observed microphone signals is provided.
The target speech signal from the target sound source may comprise (or constitute) an own voice speech signal from the hearing device user.
The target sound source may comprise (or constitute)an external speaker in the environment of the hearing device user.
The hearing device may comprise a voice interface for voice-control of the hearing device or other devices or systems.
The input to the voice interface may e.g. be based on an estimate of the user's own voice provided by an own voice beamformer configured to maintain signal components from the user's mouth while attenuating signal components from (e.g. all) other directions. The hearing device may comprise a wake-word detector based on the estimate of the user's voice. The hearing device may be configured to activate the voice interface on detection of a wake-word (e.g. with a probability above a wake-word detection threshold, e.g. larger than 60%).
The voice-interface may be incorporated in the part of the hearing device that is arranged at, behind or in the ear of the user. The hearing device may comprise one or more ‘auxiliary devices’, which communicate with the hearing device(s) and affect and/or benefit from the function of the hearing device(s). Auxiliary devices may be e.g. remote controls, audio gateway devices, mobile phones (e.g. smartphones), or music players. In such a case, the one or more auxiliary devices may comprise the voice-interface.
By providing a hearing device that comprises a voice interface, a seamless handling of the functioning of the hearing device is provided.
The hearing device may be constituted by or comprise a hearing aid, a headset, an active ear protection device or a combination thereof.
The hearing device may comprise a headset. The hearing device may comprise a hearing aid. The hearing device may e.g. comprise antenna and transceiver circuitry configured to establish a communication link to another device or system. The hearing device may e.g. be used to implement handsfree telephony.
The hearing device may further comprise a timer configured to determine a time segment of overlap between the own voice speech signal and a further speech signal.
A further speech signal may refer to a speech signal generated by a person, a radio, a tv, etc. configured to generate a speech signal.
The timer may be associated with the own voice detector. In case the target speech signal comprises speech from the hearing device user, the timer may be initiated when a further speech signal is detected at time segments where the own voice detector detects a speech signal from the user. The timer may be ended when the own voice detector does not detect a speech signal from the user. Accordingly, an unwanted speech signal may be identified and be attenuated.
The hearing device may be configured to determine whether said time segment exceeds a time limit, and if so to label the further speech signal as part of the noise signal component.
For example, the time limit may be at least ½ second, at least 1 second, at least 2 seconds.
The further speech signal may be speech from a competing speaker, and may as such be considered to be noise to the target speech signal. Accordingly, the further speech signal may be labelled as being part of the noise signal components so that the further speech signal may be attenuated.
The hearing device may be configured to label the further speech signal as being part of the noise signal components for a predetermined time segment. Hereafter, the further speech signal may be not labelled as being part of the noise signal components. For example, a voice signal from a person may be attenuated, when the person is not part of a conversation with the hearing device user, but may be not attenuated at a later time, when the person is engaging a conversation with the hearing device user.
The noise reduction system may be updated recursively. The noise signal components may be identified recursively. Accordingly, a recursive update of the noise covariance matrix may be provided. For example, a voice signal from a sound source, which at one time has been identified and labelled as being part of the noise signal components, may with time be attenuated with a continuously decreasing degree. At some time, the sound source may be exempted from being attenuated unless the sound source is once again identified and labelled as being part of the noise signal components.
The hearing device may be adapted to provide a frequency dependent gain and/or a level dependent compression and/or a transposition (with or without frequency compression) of one or more frequency ranges to one or more other frequency ranges, e.g. to compensate for a hearing impairment of a user. The hearing device may comprise a signal processor for enhancing the input signals and providing a processed output signal.
The hearing device may comprise an output unit for providing a stimulus perceived by the user as an acoustic signal based on a processed electric signal. The output unit may comprise a number of electrodes of a cochlear implant (for a CI type hearing device) or a vibrator of a bone conducting hearing device. The output unit may comprise an output transducer. The output transducer may comprise a receiver (loudspeaker) for providing the stimulus as an acoustic signal to the user (e.g. in an acoustic (air conduction based) hearing device). The output transducer may comprise a vibrator for providing the stimulus as mechanical vibration of a skull bone to the user (e.g. in a bone-attached or bone-anchored hearing device). The output unit may comprise a wireless transmitter for transmitting wireless signals comprising or representing sound to another device.
The hearing device comprises an input unit for providing one or more electric input signals representing sound. The input unit may comprise an input transducer, e.g. a microphone, for converting an input sound to an electric input signal. The input unit may comprise a wireless receiver for receiving a wireless signal comprising or representing sound and for providing an electric input signal representing said sound.
The wireless receiver and/or transmitter (e.g. a transceiver) may e.g. be configured to receive and/or transmit an electromagnetic signal in the radio frequency range (3 kHz to 300 GHz). The wireless receiver and/or transmitter may e.g. be configured to receive and/or transmit an electromagnetic signal in a frequency range of light (e.g. infrared light 300 GHz to 430 THz, or visible light, e.g. 430 THz to 770 THz).
The hearing device may comprise antenna and transceiver circuitry (e.g. a wireless receiver) for wirelessly receiving and/or transmitting a signal from/to another device, e.g. from/to an entertainment device (e.g. a TV-set), a communication device (e.g. a smartphone), a wireless microphone, a. PC, or another hearing device. The signal may represent or comprise an audio signal and/or a control signal and/or an information signal. The hearing device may comprise appropriate modulation/demodulation circuitry for modulating/demodulating, the transmitted/received signal. The signal may represent an audio signal and/or a control signal e.g. for setting an operational parameter (e.g. volume) and/or a processing parameter of the hearing device and/or a voice control command, etc. In general, a wireless link established by antenna and transceiver circuitry of the hearing device can be of any type. The wireless link may be established between two devices, e.g. between an entertainment device (e.g. a TV) or a communication device (e.g. a smartphone) and the hearing device, or between two hearing devices, e.g. via a third, intermediate device (e.g. a processing device, such as a remote control device, a smartphone, etc.). The wireless link may be a link based on near-field communication, e.g. an inductive link based on an inductive coupling between antenna coils of transmitter and receiver parts. The wireless link may be based on far-field, electromagnetic radiation. The communication via the wireless link may be arranged according to a specific modulation scheme, e.g. an analogue modulation scheme, such as FM (frequency modulation) or AM (amplitude modulation) or PM (phase modulation), or a digital modulation scheme, such as ASK (amplitude shift keying), e.g. On-Off keying, FSK (frequency shift keying), PSK (phase shift keying), e.g. MSK (minimum shift keying), or QAM (quadrature amplitude modulation), etc.
The communication between the hearing device and the other device may be in the base band (audio frequency range, e.g. between 0 and 20 kHz). Communication between the hearing device and the other device may be based on some sort of modulation at frequencies above 100 kHz. Preferably, frequencies used to establish a communication link between the hearing device and the other device is below 70 GHz, e.g. located in a range from 50 MHz to 70 GHz, e.g. above 300 MHz, e.g. in an ISM range above 300 MHz, e.g. in the 900 MHz range or in the 2.4 GHz range or in the 5.8 GHz range or in the 60 GHz range (ISM=Industrial, Scientific and Medical, such standardized ranges being e.g. defined by the International Telecommunication Union, ITU). The wireless link may be based on a standardized or proprietary technology. The wireless link may be based on Bluetooth technology (e.g. Bluetooth Low-Energy technology).
The hearing device may have a maximum outer dimension of the order of 0.08 m (e.g. a head set). The hearing device may have a maximum outer dimension of the order of 0.04 m (e.g. a hearing instrument).
The hearing device may comprise a directional microphone system adapted to spatially filter sounds from the environment, and thereby enhance a target acoustic source among a multitude of acoustic sources in the local environment of the user wearing the hearing device. The directional system may be adapted to detect (such as adaptively detect) from which direction a particular part of the microphone signal originates. This can be achieved in various different ways as e.g. described in the prior art. In hearing devices, a microphone array beamformer is often used for spatially attenuating background noise sources. Many beamformer variants can be found in literature. The minimum variance distortionless response (MVDR) beamformer is widely used in microphone array signal processing. Ideally, the MVDR beamformer keeps the signals from the target direction (also referred to as the look direction) unchanged, while attenuating sound signals from other directions maximally. The generalized sidelobe canceller (GSC) structure is an equivalent representation of the MVDR beamformer offering computational and numerical advantages over a direct implementation in its original form.
The hearing device may be or form part of a portable (i.e. configured to be wearable) device, e.g. a device comprising a local energy source, e.g. a battery, e.g. a rechargeable battery. The hearing device may e.g. be a low weight, easily wearable, device, e.g. having a total weight less than 100 g, e.g. less than 20 g, e.g. less than 10 g.
The hearing device may comprise a forward or signal path between an input unit e.g. an input transducer, such as a microphone or a microphone system and/or direct electric input (e.g. a wireless receiver)) and an output unit, e.g. an output transducer. The signal processor may be located in the forward path, The signal processor may be adapted to provide a frequency dependent gain according to a user's particular needs. The hearing device may comprise an analysis path comprising functional components for analyzing the input signal (e.g. determining a level, a modulation, a type of signal, an acoustic feedback estimate, etc.). Some or all signal processing of the analysis path and/or the signal path may be conducted in the frequency domain. Some or all signal processing of the analysis path and/or the signal path may be conducted in the time domain.
An analogue electric signal representing an acoustic signal may be converted to a digital audio signal in an analogue-to-digital (AD) conversion process, where the analogue signal is sampled with a predefined sampling frequency or rate fs, fs being e.g. in the range from 8 kHz to 48 kHz (adapted to the particular needs of the application) to provide digital samples xn (or x[n]) at discrete points in time tn (or n), each audio sample representing the value of the acoustic signal at tn by a predefined number Nb of bits, Nb being e.g. in the range from 1 to 48 bits, e.g. 24 bits. Each audio sample is hence quantized using Nb bits (resulting in 2Nb different possible values of the audio sample). A digital sample x has a length in time of 1/fs, e.g. 50 μs, for fs=20 kHz. A number of audio samples may be arranged in a time frame. A time frame may comprise 64 or 128 audio data samples. Other frame lengths may be used depending on the practical application.
The hearing device may comprise an analogue-to-digital (AD) converter to digitize an analogue input (e.g. from an input transducer, such as a microphone) with a predefined sampling rate, e.g. 20 kHz. The hearing devices may comprise a digital-to-analogue (DA) converter to convert a digital signal to an analogue output signal, e.g. for being presented to a user via an output transducer.
The hearing device, e.g. the input unit, and/or the antenna and transceiver circuitry may comprise a TF-conversion unit for providing a time-frequency representation of an input signal. The time-frequency representation may comprise an array or map of corresponding complex or real values of the signal in question in a particular time and frequency range. The TF conversion unit may comprise a filter bank for filtering a (time varying) input signal and providing a number of (time varying) output signals each comprising a distinct frequency range of the input signal. The TF conversion unit may comprise a Fourier transformation unit for converting a time variant input signal to a (time variant) signal in the (time-)frequency domain. The frequency range considered by the hearing device from a minimum frequency fmin to a maximum frequency fmax may comprise a part of the typical human audible frequency range from 20 Hz to 20 kHz, e.g. a part of the range from 20 Hz to 12 kHz. Typically, a sample rate fs is larger than or equal to twice the maximum frequency fmax, fs≥2fmax. A signal of the forward and/or analysis path of the hearing device may be split into a number NI of frequency bands (e.g. of uniform width), where NI is e.g. larger than 5, such as larger than 10, such as larger than 50, such as larger than 100, such as larger than 500, at least some of which are processed individually. The hearing device may be adapted to process a signal of the forward and/or analysis path in a number NP of different frequency channels (NP≤NI). The frequency channels may be uniform or non-uniform in width (e.g. increasing in width with frequency), overlapping or non-overt zipping.
The hearing device may be configured to operate in different modes, e.g. a normal mode and one or more specific modes, e.g. selectable by a user, or automatically selectable. A mode of operation may be optimized to a specific acoustic situation or environment. A mode of operation may include a low-power mode, where functionality of the hearing device is reduced (e.g. to save power), e.g. to disable wireless communication, and/or to disable specific features of the hearing device. A mode of operation may be a voice control mode, where a voice interface is activated, e.g. via a specific wake-word (or words), e.g. ‘Hey Oticon’. A mode of operation may be a communication mode, where the hearing device is configured to pick up the user's voice and transmit it to another device (and possibly to receive audio from another device, e.g. to enable handsfree telephony).
The hearing device may comprise a number of detectors configured to provide status signals relating to a current physical environment of the hearing device (e.g. the current acoustic environment), and/or to a current state of the user wearing the hearing device, and/or to a current state or mode of operation of the hearing device. Alternatively or additionally, one or more detectors may form part of an external device in communication (e.g. wirelessly) with the hearing device, An external device may e.g. comprise another hearing device, a remote control, and audio delivery device, a telephone (e.g. a smartphone), an external sensor, etc.
One or more of the number of detectors may operate on the full band signal (time domain). One or more of the number of detectors may operate on band split signals ((time-) frequency domain), e.g. in a limited number of frequency bands.
The number of detectors may comprise a level detector for estimating a current level of a signal of the forward path. The predefined criterion comprises whether the current level of a signal of the forward path is above or below a given (L-)threshold value. The level detector may operate on the full band signal (time domain). The level detector may operate on band split signals ((time-) frequency domain).
The hearing device may comprise a voice detector (VD) for estimating whether or not (or with what probability) an input signal comprises a voice signal (at a given point in time). A voice signal is in the present context taken to include a speech signal from a human being. It may also include other forms of utterances generated by the human speech system (e.g. singing). The voice detector unit may be adapted to classify a current acoustic environment of the user as a VOICE or NO-VOICE environment. This has the advantage that time segments of the electric microphone signal comprising human utterances (e.g. speech) in the user's environment can be identified, and thus separated from time segments only (or mainly) comprising other sound sources (e.g. artificially generated noise). The voice detector may be adapted to detect as a VOICE also the user's own voice. Alternatively, the voice detector is adapted to exclude a user's own voice from the detection of a VOICE.
The hearing device may comprise an own voice detector for estimating whether or not (or with what probability) a given input sound (e.g. a voice, e.g. speech) originates from the voice of the user of the system. A microphone system of the hearing device may be adapted to be able to differentiate between a user's own voice and another person's voice and possibly from NON-voice sounds.
The number of detectors may comprise a movement detector, e.g. an acceleration sensor. The movement detector may be configured to detect movement of the user's facial muscles and/or bones, e.g. due to speech or chewing (e.g. jaw movement) and to provide a detector signal indicative thereof.
The hearing device may comprise a classification unit configured to classify the current situation based on input signals from (at least some of) the detectors, and possibly other inputs as well. In the present context ‘a current situation’ is taken to be defined by one or more of
The classification unit may be based on or comprise a neural network, e.g. a trained neural network.
The hearing device may further comprise other relevant functionality for the application in question, e.g. compression, feedback control, etc.
The hearing device may comprise a listening device, e.g. a hearing aid, e.g. a hearing instrument, e.g. a hearing instrument adapted for being located at the ear or fully or partially in the ear canal of a user, e.g. a headset, an earphone, an ear protection device or a combination thereof. A hearing system may comprise a speakerphone (comprising a number of input transducers and a number of output transducers, e.g. for use in an audio conference situation), e.g. comprising a beamformer filtering unit, e.g. providing multiple beamforming capabilities.
In an aspect of the present application, a binaural hearing system comprising a first hearing device and an auxiliary device is disclosed. The binaural hearing system may be configured to allow an exchange of data between the first hearing devices and the auxiliary device.
In an aspect of the present application, a binaural hearing system comprising a first and a second hearing device is disclosed. The binaural hearing system may be configured to allow an exchange of data between the first and the second hearing devices, e.g. via an intermediate auxiliary device.
In an aspect, use of a hearing device as described above, in the ‘detailed description of embodiments’ and in the claims, is moreover provided. Use may be provided in a system comprising one or more hearing aids (e.g. hearing instruments), headsets, ear phones, active ear protection systems, etc., e.g. in handsfree telephone systems, teleconferencing systems (e.g. including a speakerphone), public address systems, karaoke systems, classroom amplification systems, etc.
In an aspect, a method of operating a hearing device is provided.
The hearing device may be adapted for being located at or in an ear of a user, or for being fully or partially implanted in the head of a user.
The method may comprise providing at least one electric input signal representing sound in an environment of the user.
The electric input signal may comprise a target speech signal from a target sound source and additional signal components, termed noise signal components, from one or more other sound sources.
The method may comprise providing an estimate of said target speech signal.
The noise signal components may be at least partially attenuated.
The method may comprise repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech originating from the voice of the user.
The method may further comprise identifying said noise signal components during time segments.
The own voice detector may indicate that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user, or originates from the voice of the user with a probability above an own voice presence probability (OVPP) threshold value.
It is intended that some or all of the structural features of the device described above, in the ‘detailed description of embodiments’ or in the claims can be combined with embodiments of the method, when appropriately substituted by a corresponding process and vice versa. Embodiments of the method have the same advantages as the corresponding devices.
In an aspect, a tangible computer-readable medium (a data carrier) storing a computer program comprising program code means (instructions) for causing a data processing system (a computer) to perform (carry out) at least some (such as a majority or all) of the (steps of the) method described above, in the ‘detailed description of embodiments’ and in the claims, when said computer program is executed on the data processing system is furthermore provided by the present application.
By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Other storage media include storage in DNA (e.g. in synthesized DNA strands). Combinations of the above should also be included within the scope of computer-readable media. In addition to being stored on a tangible medium, the computer program can also be transmitted via a transmission medium such as a wired or wireless link or a network, e.g. the Internet, and loaded into a data processing system for being executed at a location different from that of the tangible medium.
The method step of providing an estimate of said target speech signal, wherein said noise signal components are at least partially attenuated may be implemented in software.
The method step of repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech originating from the voice of the user may be implemented in software.
The method step of identifying said noise signal components during time segments wherein said own voice detector indicates that the at least one electric input signal, or a signal derived. therefrom, originates from the voice of the user, or originates from the voice of the user with a probability above an own voice presence probability (OVPP) threshold value may be implemented in software.
A computer program (product) comprising instructions which, when the program is executed by a computer, cause the computer to carry out (steps of) the method described above, in the ‘detailed description of embodiments’ and in the claims is furthermore provided by the present application.
In an aspect, a data processing system comprising a processor and program code means for causing the processor to perform at least some (such as a majority or all) of the steps of the method described above, in the ‘detailed description of embodiments’ and in the claims is furthermore provided by the present application.
In a further aspect, a hearing system comprising a hearing device as described above, in the ‘detailed description of embodiments’, and in the claims, AND an auxiliary device is moreover provided.
The hearing system may be adapted to establish a communication link between the hearing device and the auxiliary device to provide that information (e.g. control and status signals, possibly audio signals) can be exchanged or forwarded from one to the other.
The auxiliary device may comprise a remote control, a smartphone, or other portable or wearable electronic device, such as a smartwatch or the like.
The auxiliary device may constitute or comprise a remote control for controlling functionality and operation of the hearing device(s). The function of a remote control may be implemented in a smartphone, the smartphone possibly running an APP allowing to control the functionality of the audio processing device via the smartphone (the hearing device(s) comprising an appropriate wireless interface to the smartphone, e.g. based on Bluetooth or some other standardized or proprietary scheme).
The auxiliary device may be or comprise an audio gateway device adapted for receiving a multitude of audio signals (e.g. from an entertainment device, e.g. a TV or a music player, a telephone apparatus, e.g. a mobile telephone or a computer, e.g. a PC) and adapted for selecting and/or combining an appropriate one of the received audio signals (or combination of signals) for transmission to the hearing device.
The auxiliary device may be constituted by or comprise another hearing device. The hearing system may comprise two hearing devices adapted to implement a binaural hearing system, e.g. a binaural hearing aid system.
In a further aspect, a non-transitory application, termed an APP, is furthermore provided by the present disclosure. The APP comprises executable instructions configured to be executed on an auxiliary device to implement a user interface for a hearing device or a hearing system described above in the ‘detailed description of embodiments’, and in the claims. The APP may be configured to run on a cellular phone, e.g. a smartphone, or on another portable device allowing communication with said hearing device or said hearing system.
In the present context, a ‘hearing device’ refers to a device, such as a hearing aid, e.g. a hearing instrument, or an active ear-protection device, or other audio processing device, which is adapted to improve, augment and/or protect the heating capability of a user by receiving acoustic signals from the user's surroundings, generating corresponding audio signals, possibly modifying the audio signals and providing the possibly modified audio signals as audible signals to at least one of the user's ears. A ‘hearing device’ further refers to a device such as an earphone or a headset adapted to receive audio signals electronically, possibly modifying the audio signals and providing the possibly modified audio signals as audible signals to at least one of the user's ears. Such audible signals may e.g. be provided in the form of acoustic signals radiated into the user's outer ears, acoustic signals transferred as mechanical vibrations to the user's inner ears through the bone structure of the user's head and/or through parts of the middle ear as well as electric signals transferred directly or indirectly to the cochlear nerve of the user.
The hearing device may be configured to be worn in any known way, e.g. as a unit arranged behind the ear with a tube leading radiated acoustic signals into the ear canal or with an output transducer, e.g. a loudspeaker, arranged close to or in the ear canal, as a unit entirely or partly arranged in the pinna and/or in the ear canal, as a unit, e.g. a vibrator, attached to a fixture implanted into the skull bone, as an attachable, or entirely or partly implanted, unit, etc. The hearing device may comprise a single unit or several units communicating electronically with each other. The loudspeaker may be arranged in a housing together with other components of the hearing device, or may be an external unit in itself (possibly in combination with a flexible guiding element, e.g. a dome-like element). The hearing device may be implemented in one single unit (housing) or in a number of units individually connected to each other.
More generally, a hearing device comprises an input transducer for receiving an acoustic signal from a user's surroundings and providing a corresponding input audio signal and/or a receiver for electronically (i.e. wired or wirelessly) receiving an input audio signal, a (typically configurable) signal processing circuit (e.g. a signal processor, e.g. comprising a configurable (programmable) processor, e.g. a digital signal processor) for processing the input audio signal and an output unit for providing an audible signal to the user in dependence on the processed audio signal. The signal processor may be adapted to process the input signal in the time domain or in a number of frequency bands. In some hearing devices, an amplifier and/or compressor may constitute the signal processing circuit. The signal processing circuit typically comprises one or more (integrated or separate) memory elements for executing programs and/or for storing parameters used (or potentially used) in the processing and/or for storing information relevant for the function of the hearing device and/or for storing information (e.g. processed information, e.g. provided by the signal processing circuit), e.g. for use in connection with an interface to a user and/or an interface to a programming device. In some hearing devices, the output unit may comprise an output transducer, such as e.g. a loudspeaker for providing an air-borne acoustic signal or a vibrator for providing a structure-borne or liquid-borne acoustic signal, In some hearing devices, the output unit may comprise one or more output electrodes for providing electric signals (e.g. a multi-electrode array for electrically stimulating the cochlear nerve). The hearing device may comprise a speakerphone (comprising a number of input transducers and a number of output transducers, e.g. for use in an audio conference situation).
In some hearing devices, the vibrator may be adapted to provide a structure-borne acoustic signal transcutaneously or percutaneously to the skull bone. In some hearing devices, the vibrator may be implanted in the middle ear and/or in the inner ear. In some hearing devices, the vibrator may be adapted to provide a structure-borne acoustic signal to a middle-ear bone and/or to the cochlea. In some hearing devices, the vibrator may be adapted to provide a liquid-borne acoustic signal to the cochlear liquid, e.g. through the oval window. In some hearing devices, the output electrodes may be implanted in the cochlea or on the inside of the skull bone and may be adapted to provide the electric signals to the hair cells of the cochlea, to one or more hearing nerves, to the auditory brainstem, to the auditory midbrain, to the auditory cortex and/or to other parts of the cerebral cortex.
A hearing device, e.g. a hearing aid, may be adapted to a particular user's needs, e.g. a hearing impairment. A configurable signal processing circuit of the hearing device may be adapted to apply a frequency and level dependent compressive amplification of an input signal. A customized frequency and level dependent gain (amplification or compression) may be determined in a fitting process by a fitting system based on a user's hearing data, e.g. an audiogram, using a fitting rationale (e.g. adapted to speech). The frequency and level dependent gain may e.g. be embodied in processing parameters, e.g. uploaded to the hearing device via an interface to a programming device (fitting system), and used by a processing algorithm executed by the configurable signal processing circuit of the heating device.
A ‘hearing system’ refers to a system comprising one or two hearing devices, and a ‘binaural hearing system’ refers to a system comprising two hearing devices and being adapted to cooperatively provide audible signals to both of the user's ears. Hearing systems or binaural hearing systems may further comprise one or more ‘auxiliary devices’, which communicate with the hearing device(s) and affect and/or benefit from the function of the hearing device(s). Auxiliary devices may be e.g. remote controls, audio gateway devices, mobile phones (e.g. smartphones), or music players. Hearing devices, hearing systems or binaural hearing systems may e.g. be used for compensating for a healing-impaired person's loss of hearing capability, augmenting or protecting a normal-hearing person's hearing capability and/or conveying electronic audio signals to a person. Hearing devices or hearing systems may e.g. form part of or interact with public-address systems, active ear protection systems, handsfree telephone systems, car audio systems, entertainment (e.g. karaoke) systems, teleconferencing systems, classroom amplification systems, etc.
Embodiments of the disclosure may e.g. be useful in applications wherein a good (high quality) estimate of the voice of the user wearing the hearing device (or hearing devices) is needed.
The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:
The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.
Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.
The electronic hardware may include micro-electronic-mechanical systems (MEMS), integrated circuits (e.g. application specific), microprocessors, microcontrollers, digital signal processors (PSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, printed circuit boards (PCB) (e.g. flexible PCBs), and other suitable hardware configured to perform the various functionality described throughout this disclosure, e.g. sensors, e.g. for sensing and/or registering physical properties of the environment, the device, the user, etc. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
The present application relates to the field of hearing devices, e.g. hearing aids.
Speech enhancement and noise reduction are often needed in real-world audio applications where noise from the acoustic environment masks a desired speech signal often resulting in reduced speech intelligibility, Examples of audio applications where noise reduction can be beneficial are hands-free wireless communication devices e.g. headsets, automatic speech recognition systems, and hearing aids (HA). In particular, applications such as headset communication devices where a (‘far end’) human listener needs to understand the noisy own voice picked-up by the headset microphones, noise can greatly reduce sound quality and speech intelligibility making conversations more difficult.
‘Headset applications’ may in the present context include normal headset applications for use in communication with a ‘far end speaker’ e.g. via a network (such as office or call-centre applications) but also hearing aid applications where the hearing aid is in a specific ‘communication or telephone mode’ adapted to pick up a user's voice and transmit it to another device (e.g. a far-end-communication partner), while possibly receiving audio from the other device (e.g. from the far-end-communication partner).
Noise reduction algorithms implemented in multi microphone devices may comprise a set of linear filters, e.g. spatial filters and temporal filters that are used to shape the sound picked-up by the microphones. Spatial filters are able to alter the sound by enhancing or attenuating sound as a function of direction, while temporal filters alter the frequency response of the noisy signal to enhance or attenuate specific frequencies. To find the optimal filter coefficients, it is usually necessary to know the noise characteristics of the acoustic environment. Unfortunately, these noise characteristics are often unknown and need to be estimated online. Characteristics that are often necessary as inputs to multichannel noise reduction algorithms are e.g. the cross power spectral densities (CPSDs) of the noise. The noise CPSDs are for example needed for the minimum variance distortionless response (MVDR) and multichannel Wiener filter (MWF) beamformers which are common beamformers implemented in multi-microphone noise reduction systems.
To estimate the noise statistics, researchers have developed a wide variety of estimators of the noise statistics e.g. [1-5]. In [1,4] they propose a maximum likelihood (ML) estimator of the noise CPSD matrix during speech presence by assuming that the noise CPSD matrix remains identical up to a scalar multiplier. This estimator performs well, when the underlying structure of the noise CPSD matrix does not change over time, e.g. for car cabin noise and isotropic noise fields, but may fail otherwise. In many realistic acoustic environments, the underlying structure of the noise CPSD matrix cannot be assumed fixed, for example when a prominent non-stationary interference noise source is present in the acoustic scene. In particular, when the interference is a competing speaker, then many noise reduction systems fail at efficiently suppressing the competing speaker as it is harder to determine whether the own voice or the competing speaker is the desired speech.
In
The hearing device user 1 may wear a hearing device comprising a first microphone 4 and a second microphone 5 on a left ear of the user 1, and a third microphone 6 and a fourth microphone 7 on the right ear of the user 1.
The target sound source 2 may be located near the hearing device user 1 and may be configured to generate and emit a target speech signal into the environment of the user 1. The target source 2 may as such be a person, a radio, a television, etc. configured to generate a target speech signal. The target speech signal may be directed towards the user 1 or may be directed away from the user 1.
The noise signal components 3 are shown to surround both the hearing device user 1 and the target sound source 2 and therefore effect the target source signal received at the hearing device user 1. The noise signal components may comprise localized noise sources (e.g. a machine, a fan, etc.), and/or distributed (diffuse, isotropic) noise sound sources.
The first microphone 4, the second microphone 5, the third microphone 6 and the fourth microphone 7 may (each) provide an electric input signal comprising the target speech signal and the noise signal components 3.
In
The own voice VAD may detect that the user 1 is speaking in the time segment between t1 and t2 and in the time segment between t5 and t6. The VAD on the other hand will detect that speech (from both the user 1 and the target source 2) is being generated in the entire time segment from t1 to t8. However, depending on the resolution of the VAD used there may be a small break in detected voice activity in the segments t2 to t3, t4 to t5, and t0 to t7.
In a classical approach (upper part of
With use of an own voice VAD (lower part of
Accordingly, noise signal components may be identified during time segments (time intervals) where said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user 1, or originates from the voice of the user 1 with a probability above an own voice presence probability (OVPP) threshold value, e.g. 60%, or 70%.
Combining the own voice VAD and the VAD in the hearing device, the noise reduction system may be configured to both detect when the user 1 is speaking and when the target source 2 is speaking. Thereby, the noise reduction system may be updated during time segments where no speech signal is generated and where the user 1 is speaking, but may be prevented from updating at time segments where only the target sound source 2 is generating a target speech signal (speaking).
In
As was the case in
The competing speaker 8 may be located near the hearing device user 1 and may be configured to generate and emit a competing speech signal (i.e. an unwanted speech signal) into the environment of the user 1. The competing speaker 8 may as such be a person, a radio, a television, etc. configured to generate a competing speech signal. The competing speech signal may be directed towards the user 1 or may be directed away from the user 1.
The noise signal components 3 are shown to surround both the hearing device user 1 and the competing speaker 8 and therefore effect the estimation of the own voice of the user 1, i.e. the wanted speech signal (e.g. in case the hearing device comprises or implements a headset), received at the hearing device microphones 4,5,6,7.
In
The own voice VAD (lower part of
In a classical approach (upper part of
With use of an own voice VAD (lower part of
Accordingly, noise signal components (including from the competing speaker 8) may be identified during time segments where said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user 1, or originates from the voice of the user 1 with a probability above an own voice presence probability (OVPP) threshold value.
Combining the own voice VAD and the VAD in the hearing device, the noise reduction system may be configured to both detect when the user 1 is speaking and when the competing speaker 8 is speaking alone. Thereby, the noise reduction system may be updated during time intervals where no speech signal is generated and where the user 1 is speaking, but may be prevented from updating at time intervals where the competing speaker 8 is generating a speech signal.
In
As was the case in
The target sound source 2 and the competing speaker 8 may be located near the hearing device user 1 and may be configured to generate and emit a speech signals into the environment of the user 1. The target speech signal and/or the competing speaker speech signal may be directed towards the user 1 or may be directed away from the user 1.
The noise signal components 3 are shown to surround both the hearing device user 1, the competing speaker 8, and the target sound source 2 and may therefore affect the target source signal received at the hearing device user 1.
The first microphone 4, the second microphone 5, the third microphone 6 and the fourth microphone 7 may provide an electric input signal comprising the target speech signal, the competing speaker signal, and the noise signal components 3.
In
The own voice VAD will detect that the user 1 is speaking in the time interval between t1 and t2 and in the time interval between t5 and t6. The VAD on the other hand will detect that speech (from both the user 1, the competing speaker 8, and the target source 2) is being generated in the entire time interval from t1 to t8.
In a classical approach in which the VAD may be used to detect the presence of speech, the noise reduction system of the hearing device would only be updated at times where no speech is generated (both from the user 1, the competing speaker 8, and from the target source 2), as the VAD is not able to distinguish between speech from the user 1, the competing speaker 8, and from the target source 2. Accordingly, only at times where the VAD does not detect speech, i.e. from t0 to 1 and from t8 ongoing, the noise reduction system will be updated.
With use of an own voice VAD, the noise reduction system of the hearing device may be configured to be updated not only when no speech is detected, but also when speech from the user 1 is detected by the own voice VAD, i.e. from t0 to t2, from t5 to t6, and from t8 ongoing.
Accordingly, noise signal components may be identified during time segments where said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user 1, or originates from the voice of the user 1 with a probability above an own voice presence probability (OVPP) threshold value.
Combining the own voice VAD and the VAD in the hearing device, the noise reduction system may be configured to both detect when the user 1 is speaking and when the target source 2 and the competing speaker 8 are speaking. Thereby, the noise reduction system may be updated during time intervals where no speech signal is generated and where the user 1 is speaking, but may be prevented from updating at time intervals where the target sound source 2 is generating a target speech signal.
In
The noise reduction system (NRS) is configured to provide an estimate Ŝ(k,n) of a target speech signal (e.g. the hearing aid user's own voice, and/or the voice of a target speaker in the environment of the user), wherein noise signal components are at least partially attenuated. The noise reduction system (NRS) comprises a number of beamformers. The noise reduction system (NRS) comprises a beamformer (BF), e.g. an MVDR beamformer or a MVF beamformer, connected to the input unit (IU) and configured to receive the electric input signals (S1(k,n), . . . , SM(k,n)) in a time-frequency representation. The beamformer (BF) is configured to provide at least one beamformed (spatially filtered) signal, e.g. the estimate Ŝ(k,n) of a target speech signal.
Directionality by beamforming is an efficient way to attenuate unwanted noise as a direction-dependent gain can cancel noise from one direction while preserving the sound of interest impinging from another direction hereby potentially improving the intelligibility of a target speech signal (thereby providing spatial filtering). Typically, beamformers in hearing devices, e.g. hearing aids, have beampatterns, which are continuously adapted in order to minimize noise components while sound impinging from a target direction is unaltered. Typically, the acoustic properties of the noise signal changes over time. Hence, the noise reduction system is implemented as an adaptive system, which adapts the directional beampattern in order to minimize the noise while the target sound (direction) is unaltered.
The noise reduction system (NRS) of
The first noise reduction system (NRS1) is configured to provide an estimate of the user's own voice ŜOV. The first noise reduction system (NRS1) may comprise an own voice maintaining beamformer and an own voice cancelling beamformer. The own voice cancelling beamformer comprises the noise sources when the user speaks.
The second noise reduction system (NRS2) is configured to provide an estimate of a target sound source (e.g. a voice ŜENV of a speaker in the environment of the user). The second noise reduction system (NRS2) may comprise an environment target source maintaining beamformer and an environment target source cancelling beamformer, and/or an own voice cancelling beamformer. The target cancelling beamformer comprises the noise sources when the target speaker speaks. The own voice cancelling beamformer comprises the noise sources when the user speaks.
In the present application, a maximum likelihood estimator of the noise CPSD matrix that overcomes the limitation of the method presented [1,4] (e.g. when a prominent interference is present in the acoustic environment) is disclosed. It is proposed to extend the noise CPSD matrix model. In the following, the signal model of the noisy observations in the acoustic scene is presented. Based on the signal model, the proposed ML estimator of the interference-plus-noise CPSD matrix is derived, and the proposed method is exemplified by application to own voice retrieval.
The acoustic scene consists of a user equipped with heating aids or a headset with access to at least M>2 microphones. The microphones pick up the sound from the environment and the noisy signal is sampled into a discrete sequence xm(t) ∈ ; t ∈ 0 for all m=1, . . . , M microphones. As illustrated in
x
m(t)=so(t)*do,m(t)+vc(t)*dm(t,θc)+ve,m(t), (1)
where * denotes the convolution, do,m(t) is the relative impulse response between the m'th microphone and the own-voice source, dm(t, θc) is the relative impulse response between m'th microphone and the interference arriving from direction θc ∈ Θ, where we without loss of generality assume that Θ is a discrete set of directions Θ={−180°, . . . , 180} with I elements. An illustration of the acoustic scene is shown in
We apply the short-time Fourier transform (SDI) on xm(t) to transform the noisy signal into the time-frequency (TF) domain with frame length T, decimation factor D, and analysis window wA(t) such that
is the TF domain representation of the noisy signal where j=√{square root over (−1)}, k is the frequency bin index, and n is the frame index. The signal model for the noisy observation in the TF domain then becomes
x
m(k,n)=so(k,n)do,m(k,n)+vc(k,n)dm(k,n,θc)+vc,m(k,n), (3)
and for convenience, we vectorize the noisy observation such that x(k, n)=[x1(k, n), . . . , xM(k, n)]T and
We further assume that the relative transfer function (RTF) vectors (i.e. do(k, n) and d(k, n, θc)) remain identical over time so we may define do(k)do(k, n) and d(k, θc)d(k, n, θc). In practice, it is often the case that so(k, n), vc(k, n), and ve(k, n) are uncorrelated random processes meaning that the CPSD matrix of the noisy observations, i.e. Cx(k, n)={x(k, n)xH(k, n)}, is given as
where λs(k, n), λc (k, n), and λe (k, n) are power spectral densities (PSDs) of the own-voice, interference, and noise respectively. Γe(k, n) is the normalized noise CPSD matrix with 1 at the reference microphone index and we assume that Γe (k, n) is a known matrix, but can for approximately isotropic noise fields be modelled as
We assume that the own voice RTF vector do(k) is known, as it can be measured in advance before deployment. The parameters that remain to be estimated are λc(k, n), λe(k, n), and θc and the proposed ML estimators of these parameters will in the following section be presented.
To estimate the interference-plus-noise PSDs λc(k, n) and λe(k, n) and the interference direction θc, we first apply an own voice cancelling beamformer to obtain an interference-plus-noise-only signal (e.g. the signals from the own voice and from a competing speaker). The own voice cancelling beamformer is implemented using an own voice blocking matrix Bo(k). A common approach to find the own voice blocking matrix, is to first find the orthogonal
projection matrix of do(k) and then select the first M−1 column vectors of the projection matrix. More explicitly, let IM×M be an M×M identity matrix then IM×M−1 is the first M−1 column vectors of IM×M. The own voice blocking matrix is then given as
where Bo(k) ∈ . The own voice blocked signal, z(k, n), can be expressed as
and the own voice blocked CPSD matrix is
C
z(k,n)=λc(k,n){tilde over (d)}(k,θc){tilde over (d)}H(k,θc)+λe(k,n){tilde over (Γ)}e(k,n). (9)
Before presenting the ML estimators of λc(k, n), λe(k, n), and θc, we introduce the own voice-plus-interference blocking matrix {tilde over (B)}(θi).
This step is necessary as the ML estimator of the noise PSD, λe(k, n), further requires that the interference is removed from the own voice blocked signal z(k, n). Forming the own-voice-plus-interference blocking matrix follows similar procedure as forming the own voice blocking matrix. The own voice-plus-interference blocking matrix can be found as
where {tilde over (B)}(θi) ∈ The own voice-plus-interference blocking matrix {tilde over (B)}(θi) is a function of direction, as the direction of the interference is generally unknown. The own voice-plus-interference blocked signal is then
and the blocked own voice-plus-interference CPSD matrix is
It is common to assume that the own-voice, interference, and noise are temporally uncorrelated [6]. Under this assumption, the blocked own voice-plus-interference signal is distributed according to a circular symmetric complex Gaussian distribution i.e. z(k, n)˜C(0, Cz(k, n)), meaning that, the likelihood function for N observations of z(k, n) with Z(k, n)=[z(k, n−N+1), . . . , z(k, n)] ∈ is given as
where tr(·) denotes the trace operator and
is the sample estimate of the own voice blocked CPSD matrix. ML estimators of the interference-plus-noise PSDs λc (k, n) and λe (k, n) have been derived in [1,4]. The ML estimator of λe(k, n) is given as
with
being the sample covariance of the own voice-plus-interference blocked signal and the ML estimator of the interference PSD is then given as [7]
{circumflex over (λ)}c(k,n,θi)={tilde over (w)}H(θi)(Ĉz(k,n)−{tilde over (λ)}e(k,n,θi){tilde over (Γ)}e(k,n)){tilde over (w)}(θi), (15)
where {tilde over (w)}(θi) is the MVDR beamformer constructed from the blocked own voice CPSD matrix i.e.
Inserting the ML estimates {circumflex over (λ)}e(k, n, θi) and {circumflex over (λ)}c(k, n, θi) into the likelihood function, we obtain the concentrated likelihood function
ln
Under the assumptions that only one single interference is present in the acoustic environment and that the noisy observations across frequency bins are uncorrelated, then a wideband concentrated log-likelihood function can be derived as
where K is the total number of frequency bins of the one-sided spectrum. To obtain the ML estimate of the interference direction, we maximize the function
As θi belongs to a discrete set of directions, the ML estimate of θc is obtained through an exhaustive search over θi. Finally, to obtain an estimate of the interference-plus-noise CPSD matrix we insert the ML estimates into the interference-plus-noise CPSD model i.e.
Ĉ
v(k,n)={circumflex over (λ)}c(k,n,{circumflex over (θ)}c)d(k,{circumflex over (θ)}c)dH(k,{circumflex over (θ)}c)+{circumflex over (λ)}e(k,n,{circumflex over (θ)}c)Γe(k,n). (20)
For own voice retrieval, we implement the MWF beamformer. It is well-known that the MWF can be decomposed into an MVDR beamformer and a single channel post Wiener filter [10]. The MVDR beamformer is given as
and the single-channel post Wiener filter is
The MWF beamformer coefficients are then found as
Finally, the own voice signal can be estimated as a linear combination of the noisy observations using the beamformer weights i.e.
The enhanced TF domain signal, y(k, n) is then transformed back into the time domain using the inversion STFT, such that y(t) is the retrieved own voice time domain signal.
It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.
As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and-'or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element but an intervening element may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method is not limited to the exact order stated herein, unless expressly stated otherwise.
It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily, apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects.
The claims are not intended to be limited to the aspects shown herein but are to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.
Accordingly, the scope should be judged in terms of the claims that follow.
Number | Date | Country | Kind |
---|---|---|---|
19196675.3 | Sep 2019 | EP | regional |
This application is a Continuation of copending application Ser. No. 17/982,687, filed on Nov. 8, 2022, which is a Continuation of copending application Ser. No. 17/017,092, filed on Sep. 10, 2020 (now U.S. Pat. No. 11,533,554, issued Dec. 20, 2022), which claims priority under 35 U.S.C. § 119(a) to Application No. 19196675.3, filed in Europe on Sep. 11, 2019, all of which are hereby expressly incorporated by reference into the present application.
Number | Date | Country | |
---|---|---|---|
Parent | 17982687 | Nov 2022 | US |
Child | 18507930 | US | |
Parent | 17017092 | Sep 2020 | US |
Child | 17982687 | US |