The present invention relates to an audio-signal processor, and a method performed by that processor, for filtering an audio signal-of-interest from an input audio signal comprising a mixture of the signal-of-interest and background noise.
Audio-signal processing, particularly in regard to the audio-signal processing of human speech or voice, plays a central role in ubiquitous active voice-control and recognition. These are areas representing a rapidly increasing market sector with more and more searches now made by voice.
Audio-signal processing of speech requires devices to have an ability to extract clear speech from ongoing background noise. A sector receiving focus in this respect is next-generation speech processors for hearing assistive devices. Worldwide, disabling deafness affects 1.1 billion people, eighty-seven percent of whom live in the developing world (World Health Organisation (WHO), 2012). Hearing aids are ranked in the top 20 of the WHO Priority Assistive Products List. The Assistive Products List supports the UN Convention on the Rights of Persons with Disabilities to ensure the goal of global access to affordable assistive technology (WHO Priority Assistive Products List, 2016). A major factor contributing to elevated healthcare cost is the non-use of current hearing assistive devices by more than 60% of the hearing-impaired population issued with devices.
The main reason for the non-use of hearing assistive devices is cited to be poor speech enhancement in the presence of background noise (Taylor and Paisley, 2000). The long-term consequences of non-use of issued hearing assistive devices have been shown to be associated with cognitive decline and dementia (Wayne and Johnsrude, 2015). Therefore, improving speech enhancement in background noise is relevant for the design of signal/speech processors used in a range of both non-healthcare (e.g. phones, audio, voice/speech-activated systems/devices, enhanced listening, sound systems, speech recognition, speech-to-text systems, sonics) and healthcare devices (e.g. hearing aids, cochlear implants).
Speech recognition in ongoing background noise still remains a challenge for signal/speech audio-signal processors, which often exhibit sub-optimal performance, especially if focussing on a single-speaker's voice amongst a background of similar speakers (Kuo et al., 2010). Insight into how signal-processing strategies could be improved can be gained from physiological processes that operate in a normal hearing system to improve speech perception in noise.
It is known in the art that prior art signal processors perform sub-optimally in multi-speaker environments (Ault et al., 2018), whereas the human auditory brain performs incredibly well in such environments to enhance speech in ongoing background noise, known as the “cocktail party effect” (Cherry, 1953). For this reason, there is a need for improved audio-signal processors for filtering an audio signal-of-interest (e.g. a speech signals) from an input audio signal comprising a mixture of the signal-of-interest and background noise.
Over the past 20 years, advances in physiological and psychophysical methods have played an important role in understanding the biological systems underlying speech perception in noise. In particular mapping some of the neural mammalian pathways involved in signal (including speech) enhancement in noise (Warren and Liberman, 1989; Winslow and Sachs, 1988). It is now known that descending (efferent) neural fibres from higher-levels of the auditory system can modify auditory processing at lower-levels of the auditory system to enhance speech understanding in noisy backgrounds (Kawase et al., 1993). One such neural pathway, originating in the auditory brainstem (known as Brainstem-mediated, or “BrM” neural feedback) extending from the Superior Olivary Complex by way of the Medial OlivoCochlear (MOC) reflex, has been shown to modify the inner ear's response to sound (Liberman et al., 1996; Murugasu and Russell, 1997). A major benefit attributed to this neural feedback in humans is the improvement in detecting the signal of interest (e.g., speech) in noisy environments (Giraud et al., 1997), as evidenced by human neural-lesion studies (Giraud et al., 1997; Zeng and Liu, 2006). Other descending neural pathways from the auditory cortex (e.g. Cortical-mediated, or “CrM” neural feedback), involve attentional neural pathways which can further modify lower-level sound processing to enhance speech understanding in noise (Gao et al., 2017; Lauzon, 2017).
Human cortical brain activity, associated with attentional oscillations is known to influence speech perception in noise (Gao et al., 2017) by affecting lower-level neural feedback to the ear's response to sound (Lauzon, 2017). However, so far these effects have not been successfully incorporated in signal processors because the appropriate correspondence between oscillatory changes in auditory attention and features of the incoming stimulus had not been identified. This has recently been resolved by studies in the fields of vision and audition, using electroencephalography (EEG) and psychophysics (Yu et al., 2017; Ho et al., 2017). In vision, attentional oscillations have been shown to affect detection performance by up to 10% (Ho et al., 2017). In current auditory models, effects of human cortical attentional oscillations on perception are modelled as deterministic (a fixed decrement in performance), or random (internal noise) (Hedrick, 2016). As with vision, auditory discriminability and criterion demonstrate strong cortical oscillations in different frequency ranges of cortical activity: ˜6 Hz for sensitivity and ˜8 Hz for criterion (Yu et al., 2017; Ho et al., 2017) with both affecting signal detection/classification. Incorporating oscillatory phase data into the decision device of an auditory signal processor to enhance speech is expected to improve speech detection by a similar degree and has to date, not been considered in speech processor design.
However, knowledge of descending neural pathways (e.g. BrM and CrM neural feedback) has also been obtained predominantly from physiological studies on small non-human mammals. A known problem in the prior art has always been to design appropriate methodologies to measure the effect of this neural feedback on human hearing. Since the 2000's, psychophysical (e.g., Strickland and Krishnan (2005); Strickland (2008); Jennings and Strickland (2010; 2012); Yasin et al., 2014) and otoacoustic emission measures (Backus and Guinan, 2006) have been used to infer the effect of BrM neural feedback on auditory processing in humans. Some of this human-derived data has been used for computational modelling of the auditory system, albeit using a restricted human dataset.
A few computational models of the human auditory system (that also underlie some signal-processing strategies) have used aspects of BrM neural feedback to improve speech in background noise (Ghitza 1988; Ferry and Meddis, 2007), or have modelled how such bio-inspired feedback improves tonal sound discrimination in noise (Chintapalli et al., 2012; Smalt et al., 2014). However, these models are based on small non-human mammalian datasets and simulate the physiological and neural processes in such mammals, such that they are not optimised for human applications.
Previous auditory computational models (Ferry and Meddis, 2007; Brown et al., 2010, Clark et al., 2012 Ghitza, 1988; Messing et al., 2009; Lopez-Poveda, 2017) do implement aspects of simulated BrM neural feedback for signal-processor design, but are limited in their effectiveness in enhancing speech in noise. This is due to their use of BrM neural feedback parameters often modelled using small mammalian datasets).
In known prior art devices, for instance hearing assistive devices have incorporated surface electrodes (US 2013101128A), or are partly implanted (US2014098981A), to record bio-signals from the skin surface (electroencephalography; EEG), combined in some cases with feature extraction, (with a feedback signal re-routed from an output action, rather than based on BrM neural inspired feedback) (US2019370650A) but this has not incorporated any of the other components in the combinations described for an audio-signal processor.
It is with these problems in mind that the inventors have devised the present invention to overcome the shortcomings of the prior art.
Accordingly, the present invention aims to solve the above problems by providing, according to a first aspect, an audio-signal processor for filtering an audio signal-of-interest from an input audio signal comprising a mixture of the signal-of-interest and background noise, the processor comprising a frontend unit, the frontend unit comprising a filterbank comprised of an array of bandpass filters, a sound level estimator, and a memory with one or more input-output, I/O, functions stored on said memory; wherein the frontend unit
is configured to receive:
Filterbanks include an array (or “bank”) of overlapping bandpass filters which include the one or more bandpass filters. The audio-signal processor includes a ‘front-end unit’ with a filterbank. The filterbank response is “tuned” to the neural feedback (i.e., NIFS) based on human data. To accomplish this, across-filter tuning of the feedback is based on previous published (Yasin et al., 2014; Drga et al., 2016) and unpublished human psychophysical and physiological data. For example, aspects of the NIFS may be based on the published and unpublished data from humans used to depict the change of the full I/O function in response to neural feedback activated by unmodulated and modulated sounds.
In preferred embodiments, the NIFS may be parameters derived from brain recordings (e.g. direct ongoing, pre-recorded, or generic human-derived datasets) and/or measurements derived by psychophysical/physiological assessments (e.g. direct ongoing, pre-recorded, or generic human-derived datasets). The NIFS may be processed by a higher-level auditory information processing module in conjunction with information received from a sound feature onset detector module and a signal to noise ratio estimator module at the output of the frontend unit.
The processor's ‘front-end unit’ includes the filterbank. The filterbank comprises an array of overlapping bandpass filters covering (for instance in a hearing aid) the range of frequencies important for speech perception and music. A gain and input-output (I/O) level function for a given filter and/or range of filters is set as follows depending on the input.
I/O functions can be represented graphically showing how an output signal level (e.g. of a hearing aid) varies at various input signal levels. In this way, as is also known in the art, I/O functions can be used to determine the gain (in decibels), G (i.e. gain (G)=output (O)−input (I)) with respect to a given input (I) signal level. The I/O function can also be used to determine a change in gain (AG) with respect to a given input (I) signal level. Sometimes, the change in gain (AG) is referred to as “compression”. The I/O function may be derived from published and unpublished human data sets using both modulated noise (of varying types) and unmodulated noise (e.g., Yasin et al., 2020). The filterbank output (such as the sound level estimates) are used to modify the I/O function stored (e.g. on a memory) and determines an “enhanced” I/O function. As the I/O function is modified it becomes, as the names suggest, more honed or improved (“enhanced”) for purpose over time.
In one embodiment, the one or more I/O functions stored on the memory maybe human-derived I/O functions. In this example, the frontend unit is configured to modify the human-derived I/O functions in response to the received sound level estimates and the NIFS and determine an enhanced I/O functions.
In other preferred embodiments, elements of processed information are used in conjunction with information derived from the output of a feature extraction module in order to feed into a machine learning unit. In some embodiments, the machine learning unit has an internal decision device that interacts with the higher-level auditory information processing module in order to further optimise the NIFS parameters and optimise speech enhancement in background noise in the resultant output speech-enhanced filtered output audio signal.
Enhanced I/O functions may specify how the functions are affected by sound level estimates as well as by neural feedback (e.g. BrM and CrM feedback and feedback from other higher levels of the auditory system) depending on the input level and temporal parameters of the sound input. In this way, incoming speech and background noise mixture is processed by an “enhanced” filterbank where a number of filter attributes can be adapted by neural-inspired feedback within the processor.
The processor of the present application advantageously incorporates human-derived neural-inspired feedback signals (NIFS) into an audio-signal processor. NIFS refer to aspects of neural feedback that are uniquely used by the human brain during the human brain's biological signal-processing of sound. The NIFS may, in some uses of the audio-processor refer to parameters derived from direct ongoing recordings of brain activity (e.g., such as being received from EEG), or use pre-recorded or generic human-derived datasets relating to brain activity recordings from humans. In other cases the NIFS may be derived from previous published (Yasin et al., 2014; Drga et al., 2016) and unpublished human psychophysical and physiological data (generic human-derived datasets), or have been derived by psychophysical/physiological assessments conducted on the user (direct ongoing or pre-recorded).
The audio-signal processor of the present application may be used to perform the function of filtering an audio signal-of-interest (e.g. a speech signal) from an input audio signal comprising a mixture of the signal-of-interest and background noise. For this reason, the claimed processor can be used in various hearing assistive devices, such as hearing aids or cochlear implants, for example. In other words, the processor of the present invention uses a biomimicry of the human's auditory system in order to emulate the human brain's improved ability for audio-signal filtering, and therefore provides an improved audio-signal processor over known prior art audio-signal processors and/or signal-processing strategies.
By using NIFS, the claimed audio-signal processor of the present invention may be thought of as a “Neural-Inspired Intelligent Audio Signal” or “NIIAS” processor, where the input data is processed using parameters that are biologically inspired (bio-inspired) from humans.
These parameters could be derived from direct ongoing recordings of brain activity (e.g., such as being received from EEG), or use pre-recorded or generic human-derived datasets relating to brain activity recordings from humans. In other cases the NIFS may be derived from previous published (Yasin et al., 2014; Drga et al., 2016) and unpublished human psychophysical and physiological data (generic human-derived datasets), or have been derived by psychophysical/physiological assessments conducted on the user (direct ongoing or pre-recorded). The claimed audio-signal processor may be thought of as Neural-Inspired Intelligent Audio Signal, as these parameters are improved or optimised for the user by way of the machine learning unit.
The processor of the present application provides improved speech in (background) noise performance when compared to other audio signal processors by using a strategy based on a human-derived neural feedback mechanisms that operates in real time to improve speech in noisy backgrounds.
Optionally, the claimed processor can be integrated into a variety of speech recognition systems and speech-to-text systems. In this way, the claimed processor may also be referred to as a “NIIAS Processor Speech Recognition” or a “NIIASP-SR”. Example applications of the claimed processor may be for use in systems where clear extraction of speech against varying background noise is required. Examples of such applications include but are not limited to automated speech-recognition software and/or transcription software such as Dragon Naturally Speaking, mobile phone signal processors and networks, such as Microsoft™ speech-recognition, Amazon's Alexa™, Google Assistant™, Siri™ etc.
Optionally, the claimed processor may be used as a component for cochlear implants. In this way, the claimed processor may be referred to as a “NIIAS Processor Brain Interface Cochlear Implant” or a “NIIASP-BICI”. In this example application, the claimed processor may be integrated within the external unit speech processor unit of a continuous integration (CI) with surface electrodes. The surface electrodes may be used to provide an electrode input for the claimed processor. The surface electrodes may be located within the ear canal of a user in order to record ongoing brain activity and customise operation to the user. The claimed processor and combined electrode input may be used to modulate current flow to the electrodes surgically implanted within the cochlea of the inner ear. Potentially, a device utilising the claimed processor would be purchased by the private health sector and the NHS.
Optionally, the claimed processor may also be used within the wider field of robotics systems. In this way, the claimed processor may be referred to as a “NIIAS processor Robotics” or a “NIIASP-RB”. In this example application, the claimed processor model can also be incorporated into more advanced intelligent systems design that can use the incoming improved speech recognition as a front-end for language acquisition and learning and more higher-level cognitive processing of meaning and emotion.
Optionally, the claimed processor may also be used as an attentional focussing device. In this way, the claimed processor may also be referred to as a “NIIAS Processor Attention” or a “NIIASP-ATT”. In this example application, the claimed processor in-the-ear model with surface electrodes can also be combined with additional visual pupillometry responses to capture both audio and visual attentional modulation. Attentional changes captured by visual processing can be used to influence the audio event detection, and vice-versa. Such a device, utilising the claimed processor, can be used by individuals to enhance attentional focus (this could possibly include populations with attention-defect disorders, or areas of work in which enhanced attention/sustained attention is required) and aspects of such a system could also be used by individuals with impaired hearing.
Further optional features of the invention will now be set out. These are applicable singly or in any combination with any aspect of the invention.
In use, incoming speech and background noise mixture is processed by the frontend unit including the filterbank and a number of parameters can be modified in response to the received NIFS within the processor.
Optionally, the one or more modified parameters may include: i) a modified gain value and ii) a modified compression value for a given input audio signal, and wherein the frontend unit may be further configured to: apply the modified gain value and the modified compression value to the unfiltered input audio signal by way of modifying the input or parameters of a given filter or range of filters of the filterbank to determine a filtered output audio signal.
A few prior art models have used very limited data relating to neural-inspired feedback, in particular, BrM feedback from humans (e.g. a single time constant). However, all prior art models have used that information in a limited way. For example, in prior art models the effects of the neural-inspired feedback is not configured to be tuned across auditory filters, and have used a limited range of time-constants, and do not apply the neural-inspired feedback to modify an I/O function of I/O functions, the front-end gain, and the compression within and across auditory filters during the time-course of sound stimulation.
Optionally, the claimed processor may be used within a hearing aid device. In this way, a hearing aid device using the claimed processor may also be referred to as a “NIIAS processor Hearing Aid” or a “NIIASP-HA”. For example, the processor can be housed in an external casing, existing outside of the ear or an in-the-ear device, such as within the concha or ear canal. Alternatively, or additionally, the hearing aid device may operate as a behind-the-ear device, accessible and usable by a substantial proportion of the hearing-impaired user market and purchased via private health sector as well as independent hearing-device dispensers and purchased by the NHS. In this way, the architecture of the claimed processor can also be used to design cost-effective hearing aids (e.g. by using 3-D printed casings) coupled to mobile phones (to conduct some of the audio-processing) for the hearing-impaired (i.e. referred to as a “NIIASP-HA-Mobile”). In this embodiment, most of the complex audio-processing can be conducted by a smartphone connected by a wireless connection (e.g. Bluetooth™) to the behind-the-ear hearing aid.
Optionally, the audio-signal processor may further comprise: a Higher-Level Auditory Information (HLAI) processing module, comprising an internal memory. The HLAI processing module may be configured to receive human-derived brain-processing information (e.g. such as parameters derived either directly from brain recordings; e.g. ongoing brain recordings) via recordings from surface electrodes or indirectly via pre-recorded or generic human-derived datasets relating to brain activity recordings from humans and/or measurements derived by psychophysical/physiological assessments [e.g. direct ongoing, pre-recorded, or generic human-derived datasets such as from previous published (Yasin et al., 2014; Drga et al., 2016) and unpublished human psychophysical and physiological data], and store it on its internal memory and, using said brain-processing information, the HLAI may be further configured to simulate aspects of the following, which constitute aspects of the NIFS:
Prior-art auditory models have used aspects of BrM neural-inspired feedback based on information derived from small non-human mammalian datasets, rather than derived from humans.
Optionally, the HLAI processing module may be further configured to derive the human-derived NIFS using said simulated and/or direct BrM and/or CrM neural feedback information and relay said NIFS to the frontend unit.
Optionally, the HLAI processing module may be configured to receive the brain-processing information by direct means or by indirect means from higher-levels of auditory processing areas of a human brain, and wherein the brain-processing information may be derived from any one of, or a combination of, the following: psychophysical data, physiological data, electrophysiological data, or electroencephalographic (EEG) data.
Optionally, the human-derived brain-processing information may further comprise a range of time constants which define an exponential build-up and a decay of gain with time derived from human-derived measurements; and wherein HLAI processing module may be further configured to modify the human-derived NIFS using said time constants in response to the received brain-processing information and relay said NIFS to the frontend unit. The range of time constants can be measured psychophysically from humans. For instance, the inventors have developed a method by which such time constants can be measured from humans and have measured time constants ranging from 110 to 140 ms in humans. In simulating speech recognition effects using an ASR system the inventors have shown a beneficial effect of a range of time constants having any value between 50 to 2000 ms (Yasin et al 2020).
In some embodiments of the audio-signal processor, the range of time constants defining the build-up and decay of gain τon and τoff respectively, may extend to any value below 100 ms. For example, the time constants may lie within a range that is a contiguous subset of values lying from 0 (or more) to 100 ms (or less). For example, the range of time constants may be any value between 5 to 95 ms, for example any value between 10 to 90 ms. The range of time constants may be any value between 15 to 85 ms, such as any value between 20 to 80 ms, for example any value between 25 to 75 ms. The range of time constants may be any value between 30 to 70 ms, such as any value between 35 to 65 ms, for example any value between 40 to 60 ms. The range of time constants may be any value between 45 and 55 ms, for example being values such as 46 ms, 47 ms, 48 ms, 49 ms, 50 ms, 51 ms, 52 ms, 53 ms, and/or 54 ms.
In other embodiments, the range of time constants could be any value between 50 to 2000 ms. For example, the time constants may lie within a range that is a contiguous subset of values lying from 50 (or more) to 2000 ms (or less). In other embodiments of the audio-signal processor, the range of time constants may be any value between 90 to 1900 ms, such as any value between 100 to 1800 ms, for example 110 to 1700 ms. The range of time constants may be any value between 120 to 1600 ms, such as any value between 130 to 1500 ms, for example 140 to 1400 ms. The range of time constants may be any value between 150 to 1300 ms, such as any value between 160 to 1200 ms, for example 170 to 1100 ms. The range of time constants may be any value between 180 to 1000 ms, such as any value between 190 to 900 ms, for example 200 to 800 ms. The range of time constants may be any value between 210 to 700 ms, such as any value between 220 to 600 ms, for example 230 to 500 ms. The range of time constants may be any value between 240 to 400 ms, such as any value between 250 to 300 ms.
The enhanced I/O functions (for a range of signal levels and temporal relations), from which gain is estimated, may describe the change in output with respect to the input (and therefore the gain at any given input), also the change in gain with input, which defines the compression, and the build-up and decay of gain. The build-up and decay of gain may be specified by time constants of τon and τoff respectively, which are also derived from human auditory perception studies involving BrM neural feedback effects (published Yasin et al., (2014) and unpublished data) which define the filter gain build-up and decay in gain effects.
Optionally, the range of time constants may comprise human-derived onset time build-up constants, τon, applied to the I/O function(s) stored in the frontend unit to modify the I/O function(s) and derive the enhanced I/O function(s) stored in the front end to modify the rate of increase of the gain value, the effects of which are subsequently applied to the filter or filters of the filterbank.
In some embodiments, the onset time build-up constants τon maybe derived from human data relating to both steady-state and modulated noise. In other embodiments, the onset time build-up constants τon maybe any time-value constant which is not necessarily human-derived.
In this way, τon can be considered to be a “build-up of gain” time constant.
Optionally, the range of time constants may comprise human-derived offset time decay constants, τoff, applied to the I/O functions stored in the frontend unit to modify the I/O functions and derive the enhanced I/O functions stored in the front end to modify the rate of decrease of the gain value, the effects of which are subsequently applied to the filter or filters of the filterbank. In some embodiments, the offset time decay constants τoff are derived from human data relating to both steady-state and modulated noise. In other embodiments, the offset time decay constants off maybe any time-value constant which is not necessarily human-derived.
In this way, τoff can be considered to be a “decay of gain” time constant.
There may be a continuum of gain values derived from human datasets, dependent on input sound aspects such as level, and temporal characteristics which define the filter gain applied.
In some embodiments, the modified gain values may be a continuum of gain values, derived from human data that may be in the following range: 10 to 60 dB. In some embodiments, the gain values are derived from human data relating to both steady-state and modulated noise.
Optionally, the modified gain values may be a continuum of gain values that have a values anywhere from 0 to 60 dB, depending on the external averaged sound level (as processed via the filterbank) and current instantaneous sound level. In other embodiments, the continuum of gain values maybe any continuum of numerical gain values which are not necessarily human-derived.
Optionally, the modified gain values are a continuum of gain values that may have a value of 10 dB or more, 15 dB or more, 20 dB or more, 25 dB or more or 30 dB or more; and may have a value of 60 dB or less, 55 dB or less, 50 dB or less, 45 dB or less, or 40 dB or less. The continuum of gain values may have a total range of 10 to 60 dB or the continuum of gain values may cover a range of: 15 to 55 dB, a range of 20 to 50 dB, or a range of 25 to 45 dB. The continuum of gain values may have a range of 30 to 40 dB, for example a value of 35 dB would fit within the range.
In some embodiments, the modified gain values may be any value greater than 60 dB, such as being a continuum of gain values that fall within a range and where the upper and lower boundaries of that range are greater than 60 dB.
The gain may be obtained from a continuum of gain values derived from enhanced I/O functions inferred from simulation studies using unmodulated and modulated sounds (with a range of signal levels and temporal settings) from Yasin et al. (2020), secondary unpublished data analyses based on data from Yasin et al. (2014), and unpublished human data using unmodulated and modulated signals.
There may be a continuum of compression estimates (the change in gain) also derived from human datasets, dependent on input sound aspects such as level, wherein input sound aspects comprise a sound level and temporal characteristics which define the compression applied. In contrast, other current models use a “broken-stick” function to apply compression, providing a limited range of compression values (e.g. Ferry and Meddis, 2007),
In some embodiments of the audio-signal processor, the modified compression values may be a continuum of compression values that may be in the following range: 0.1 to 1.0. In some embodiments, the compression values are derived from human data relating to both steady-state, unmodulated, and modulated signals.
In other embodiments of the audio-signal processor, the continuum of compression values may include any compression value between 0.15 to 0.95 inclusive, any value between 0.20 to 0.90 inclusive, or any value between 0.25 to 0.85 inclusive. The continuum of compression values may cover values within the range of 0.25 to 0.80 inclusive, such as any value between 0.30 to 0.75 inclusive, for example 0.35 to 0.70 inclusive. The continuum of compression values may include any value between 0.40 to 0.65 inclusive, such as any value between 0.45 to 0.60 inclusive, for example 0.50 to 0.65 inclusive. The continuum of compression values may include any value between 0.55 to 0.60.
Optionally, the filter or filters of the filterbank within the frontend unit may be further configured so as to modify a bandwidth of each of the one or more bandpass filters.
In this way, the applied gain (and thereby effect of any neural-inspired feedback) may be applied per filter (channel) as well as across filters (e.g., Drga et al., 2016).
Optionally, the modified gain and compression values may be:
Optionally, the audio-signal processor further comprises a sound feature onset detector configured to receive the filtered output audio signal from the frontend unit (e.g. where front end unit houses the filterbank, the internal memory, and the sound level estimator) and detect sound feature onsets, and wherein the sound feature onset detector may be further configured to relay the sound feature onsets to the HLAI processing module, and the HLAI processing module may be configured to store said sound feature onsets on its internal memory for determining the NIFS.
Optionally, the sound feature onset detector may be further configured to relay the filtered output audio signal to the HLAI processing module and the HLAI processing module configured to store the filtered output audio signal on its internal memory.
Optionally, the audio-signal processor may further comprise a signal-to-noise ratio, SNR, estimator module configured to receive the filtered output audio signal from the frontend unit and determine a SNR of the mixture of the signal-of-interest and the background noise, and wherein the SNR estimator module may be further configured to relay the SNR to the HLAI processing module, and the HLAI processing module may be configured to store said SNR on its memory for determining the NIFS. In this way, the SNR estimator module uses a changing temporal window to determine an ongoing estimate of the signal-to-noise ratio (SNR) values of the filtered output signal from the frontend unit.
Optionally, the audio-signal processor further comprises a machine learning unit comprising a decision device in data communication with the HLAI processing module, the decision device comprising an internal memory, the decision device may be configured to receive data from the HLAI processing module and store it on its internal memory, wherein the decision device may be configured to process the data and output a speech-enhanced filtered output audio signal.
Optionally, the HLAI and the decision device may utilise pre-recoded generic human data in conjunction with the machine learning unit (also known as a “deep learning component”). For this reason, this embodiment may not have the degree of customisation and individualisation for use in a stand-alone hearing aid (as previously described). This embodiment of the processor may at least be able to access a substantial population and link-up with healthcare providers in the developing world in order to provide a long-term ongoing provision, with minimal upkeep.
However, as digital healthcare evolves, a hearing aid device using the claimed processor may be able to operate as a customised stand-alone device (using either directly recorded information and/or pre-recorded information) remotely adapted by a centralised healthcare system. For example, distributing the audio-processing between a smartphone and hearing aid may be able to reduce overall cost to the user and make the system more accessible to a much larger population. Advantageously, the development of a core auditory model of the claimed processor with improved speech recognition in noise can be incorporated into cost-effective hearing assistive devices linked to mobile phones in order to provide much of the developed, and developing world, with robust and adaptable hearing devices.
The machine learning unit may also be referred to as a “Semi-Supervised Deep Neural Network” or a “SSDNN”. The claimed processor may use the neural feedback information (derived from any one of or a combination of the following: psychophysical data, physiological data, electrophysiological data, or electroencephalographic (EEG) data) from both the human brainstem and/or cortex (e.g. including attentional oscillations), in association with incoming sound feature extraction combined with sound feature onset detection information in order to inform both the NIFS and the decision device using the SSDNN, with capacity to learn and improve speech recognition capability over time, with ability to be customised to the individual through a combination of SSDNN and further direct/indirect recordings.
When comparing the novel architecture of the (NIIAS) audio-signal processor with known audio processors/devices it is evident that, although they may include one or more of the elementary characteristic processing stages, they do not combine the processing in the way described or include other key components of the architecture such as BrM neural feedback connected with the CrM neural processing component, the SNR extraction combined with the feature extraction, the decision device as embedded within the SSDNN architecture and connected with the CrM neural processing component to enhance speech in noise.
Optionally, the audio-signal processor may further comprise a feature extraction, FE, module, said FE module may be further configured to perform feature extraction on the filtered output audio signal, and the FE module may be further configured to relay the extracted features to the machine learning unit, and the decision device is configured to store the extracted features in its internal memory.
Optionally, the claimed processor can be housed in a casing embedded with surface electrodes to make contact with the ear canal or outer concha area to record activity from the brain. In this way, the claimed processor may also be referred to as a “NIIAS Processor Ear-Brain Interface” or “NIIASP-EBI”. In this example application, surface electrodes record ongoing brain activity and customise operation to the user. In this example application, the HLAI component and decision device can use direct brain activity in conjunction with the machine learning unit, to customise the device to the users' requirements. The device can be used by normal-hearing individuals as an auditory enhancement device, for focussing attention, or by hearing impaired as an additional component to the hearing aids described earlier. Such a device can be purchased commercially (e.g. as an auditory/attentional enhancement device) or health sector/independent hearing-device dispensers (e.g. as a hearing aid).
Optionally, the SNR estimator module may be configured to relay the filtered output audio signal with the SNR estimation values to the FE module, the FE module is configured to relay the filtered output audio signal to the machine learning unit, and the decision device is configured to store the filtered output audio signal in its internal memory.
Optionally, the decision device may be configured to process:
Optionally, the machine learning unit further comprises a machine learning algorithm stored on its internal memory, and wherein the decision device applies an output of the algorithm to the data received from the HLAI processing module, including the SNR values, the extracted features, and derives neural-inspired feedback parameters.
Optionally, the machine learning algorithm may encompass a combination of both supervised and unsupervised learning using distributed embedded learning frames. In other embodiments, the machine learning algorithm may use feature extraction information and the input from the SNR estimator module to learn dependencies between the signal and feature extraction, using input from the HLAI to predict optimal HLAI and subsequently NIFS values over time.
Optionally, the claimed processor may also be used as a brain-ear interface, designed as part of an in-the-ear device for enhanced audio experience when using virtual reality displays/systems. In this way, the claimed processor may also be referred to as “NIIAS Processor Virtual/Augmented Reality” or “NIIASP-VAR”. In this example application, the claimed processor may be incorporated into device in electronic communication with surface electrodes within the ear of a user in order to record ongoing brain EEG signals for monitoring attention shifts due to ongoing audio and visual input for enhanced user experience in virtual reality/augmented reality environments. For instance, the processor can be used to direct the user towards augmented/virtual reality scene/events based on prior or ongoing brain behaviour and enhance attentional focus. Such pre-attentional activity can be used as additional parameters for the machine learning algorithm to predict user audio-visual and attentional behaviour in VR/AR environments.
Optionally, the derived neural-inspired feedback parameters may be relayed, from the decision device, to the HLAI processing module and the HLAI processing module may be configured to store said neural-inspired feedback parameters on its memory for determining the NIFS. In other words, the HLAI processing module can use information from the machine learning unit, higher-level brain processing data (i.e. BrM and CrM feedback data), sound feature onsets, and SNR data, in order to optimise the parameters of the NIFS sent to be applied at the level of the filterbank.
In this way, incoming speech and background noise mixture is processed by the enhanced filterbank unit, with a number of attributes that can be adapted by neural-inspired feedback within the processor, such as filter gain, the change in gain with respect to the input (compression), the build-up and decay of gain (τon and τoff) and the filter tuning (associated with the change in gain). The input sound level may also be estimated per and across filter channel(s)
In this way, the audio-signal processor of the present application may be used as a core processor for the previously discussed variety of uses listed here.
According to a second aspect, there is provided a method of filtering an audio signal-of-interest from an input audio signal comprising a mixture of the signal-of-interest and background noise, the method performed by a processor comprising a frontend unit. The frontend unit comprising a filterbank, the filterbank comprising one or more bandpass filters, a sound level estimator, and a memory with an input-output, I/O, function(s) stored on said memory, wherein the filterbank is configured to perform the following method steps:
Optionally, the one or more modified parameters include: a modified gain value and a modified compression value for a given input audio signal, and wherein the filterbank is further configured to perform the following method steps:
Embodiments of the invention will now be described by way of example with reference to the accompanying drawings in which:
Aspects and embodiments of the present invention will now be discussed with reference to the accompanying figures. Further aspects and embodiments will be apparent to those skilled in the art. All documents mentioned in this text are incorporated herein by reference.
The frontend 120 and the HLAI processing module 140 may take the form of sub-processors that are part of the same audio-signal processor 100. In the embodiment shown in
The receiving unit 110 is any device that converts sound into an electrical signal, such as an audio microphone or a transducer as are known in the art. As is also known in the art, the filterbank 121 includes one or more bandpass filters (not shown in the figures). For example, the one or more bandpass filters are an array (or “bank”) of overlapping bandpass filters.
The frontend unit 120 includes the filterbank 121, its own internal (e.g. built-in) storage memory 122 and a sound level estimator 123. A sound level is estimated by the sound level estimator 123 per channel and/or summed across filter channels and is used to select the appropriate I/O function parameters.
Input/Output (I/O) gain functions (hereafter referred to as an “I/O functions”) are stored on the memory 122 of the frontend unit 120. As is known in the art, I/O functions can be represented graphically showing how an output signal level (e.g. of a hearing aid) varies at various input signal levels. In this way, as is also known in the art, I/O functions can be used to determine the gain (in decibels), G (i.e. gain (G)=output (O)−input (I)) with respect to a given input (I) signal level. The I/O functions can also be used to determine a change in gain (AG) with respect to a change in given input (I) signal level. Sometimes, the change in gain (AG) is referred to as “compression”.
The HLAI processing module 140 receives human-derived brain-processing information, to generate or derives human-derived Neural-Inspired Feedback Signals (NIFS) and stores them on its own internal (e.g. built-in) storage memory 142. To do this, the HLAI processing module 140 receives human-derived brain-processing information 144 (referring to parameters derived from brain recordings (direct ongoing, pre-recorded or generic human-derived datasets) and/or measurements derived by psychophysical/physiological assessments (direct ongoing, pre-recorded or generic human-derived datasets), and stores it on its internal memory 142 and, using said brain-processing information 144, the HLAI processing module 140 simulates: i) brainstem-mediated (BrM) neural feedback information and ii) cortical-mediated (CrM) neural feedback information (including information relating to attention). The HLAI processing module 140 then derives the human-derived NIFS using the simulated BrM and/or CrM neural feedback information and relays the NIFS to the frontend unit 120. In addition, the HLAI processing module 140 may store the derived NIFS on its internal memory 142. Alternatively, or additionally, the HLAI processing module 140 modifies the human derived NIFS in response to the received human-derived brain-processing information 144 and relays said NIFS to the frontend unit 120.
The HLAI processing module 140 is used to improve decision capability within the audio-signal processor 100. The brain-processing information 144 may include psychophysical, physiological, electroencephalographic (EEG) or other electrophysiological/electroencephalographic derived measurements, obtained by direct means (electroencephalographic (EEG) or other electrophysiological/electroencephalographic derived measurements) or indirect means (psychophysical, physiological), and be measured in real-time (ongoing) or pre-recorded and stored. The HLAI processing module 140 receives the brain-processing information 144 by a direct ongoing means of brain recordings from higher-levels of auditory processing areas of a human brain (e.g. from the brainstem/cortex), such as using EEG data (e.g., event-related, ongoing, oscillatory, attentional) retrieved from contact-electrodes, for example. Alternatively, the HLAI processing module 140 receives the brain-processing information 144 by a direct pre-recorded means of brain recordings (pre-recorded from either the user or generic human-derived datasets of higher-level processing from auditory processing areas and associated areas of a human brain). Alternatively, the HLAI processing module 140 receives the brain-processing information 144 by an indirect means (ongoing recorded from the user) derived by psychophysical/physiological assessments. Alternatively, the HLAI processing module 140 receives the brain-processing information 144 by an indirect means (pre-recorded from the user/generic human-derived datasets) derived by psychophysical/physiological assessments.
The pre-recorded and stored generic human-derived datasets may be updated as required. In both cases, (i.e. by using direct or indirect means) the brain-processing information 144 is derived from any one or a combination of the following: psychophysical data, physiological data, electrophysiological data, or electroencephalographic (EEG) data.
In use, the frontend unit 120 receives: an unfiltered input audio signal 111 from the receiving unit 110 and the human-derived NIFS from the HLAI processing module 140. The frontend unit 120 extracts sound level estimates from an output of the one or more bandpass filters of the filterbank 121, using the sound level estimator 123. In this way, the sound level estimator 123 estimates a sound level output from the array of overlapping filters.
The frontend unit 120 modifies the I/O function(s) stored on the memory 122 in response to the received sound level estimates and the NIFS and determines enhanced I/O function(s). As an I/O function is modified it becomes, as the names suggest, a more “enhanced” I/O function. The frontend unit 120 then stores the enhanced I/O function on its memory 122 (e.g. for reference or later use). The frontend unit 120 uses the enhanced I/O function to determine one or more modified filterbank parameters of the filterbank 121 in response to the received NIFS from the HLAI processing module 140. This is an “enhanced” I/O function as previous models have used only a “broken stick” function to model the I/O stage.
The one or more modified parameters determined by the enhanced I/O function include: i) a modified gain value and ii) a modified compression value for a given input audio signal. The frontend unit 120 stores the modified gain value and the modified compression value onto its memory 122. At a later time, the frontend unit 120 will retrieve the modified gain value and the modified compression value from its memory 122 and apply them to the unfiltered input audio signal 111 at the level of the filterbank 121, and determine a filtered output audio signal 112.
Referring to
As shown in
As shown in
The method of adjusting the filter(s) of the filterbank 121 in response to the obtained one or more modified parameters includes: a modified gain value and a modified compression value for a given input audio signal. As such, the frontend unit 120 is further configured to perform the following method steps: vii) applying the modified gain value and the modified compression value to the unfiltered input audio signal 111, by adjusting parameters associated with the filter(s) of the filterbank 121, and then viii) determining a filtered output audio signal 112.
The human-derived brain-processing information 144 further includes a range of time constants (τ). In simulating speech recognition effects using an ASR system the inventors have shown a beneficial effect of a range of time constants having any value between 50 to 2000 ms (Yasin et al 2020). The range of time constants (τ) define a build-up and a decay of gain with time derived from human-derived measurements. The HLAI processing module 140 derives the human derived NIFS using said time constants and relays said NIFS to the frontend unit 120. Alternatively, or additionally, the HLAI processing module 140 modifies the human derived NIFS using said time constants in response to the received human-derived brain-processing information 144 and relays said NIFS to the frontend unit 120. The range of time constants are measured psychophysically from humans, typically covering the range 50-2000 ms. Optionally, the range of time constants may include values between 110 to 140 ms (Yasin et al., 2014). In other words, the HLAI processing module 140 uses information from the machine learning unit, higher-level brain processing data (i.e. BrM and CrM feedback data), sound feature onsets, SNR data, in order to optimise the parameters of the NIFS sent to be applied at the level of the frontend unit 120.
The BrM neural inspired feedback uses a range of human-derived onset and offset decay time constants (τon and τoff, respectively) associated with measured BrM neural feedback. The front-end gain and compression parameters are adaptable, dependent on the BrM neural-inspired feedback parameters, such as the time constants (τon and τoff). The range of time constants (τ) includes onset time build-up constants (τon). In response to receiving the human-derived brain-processing information 144 in the form of onset time build-up constants (τon), the HLAI processing module 140 derives NIFS that are used by the frontend unit 120 to modify the enhanced I/O function(s) stored on the memory 122 of the frontend 120 to modify the rate of increase of the gain value applied to filter or filters of the filterbank 121. In this way, the onset time build-up constants, (τon) can be considered “build-up of gain” time constants.
The range of time constants (τ) may include offset time decay constants (τoff). In response to receiving the human-derived brain-processing information 144 in the form of offset time decay constants (τoff), the HLAI processing module 140 derives NIFS that are used by the frontend unit 120 to modify the enhanced I/O function(s) stored on the memory 122 of the frontend 120 to modify the rate of decrease of the gain value. In this way, the offset time decay constants (τoff) can be considered “decay in gain” time constants.
The BrM neural-inspired feedback is “tuned” across a given frequency range, and thus across one or more filters of the filterbank 121 (within the frontend 120 front-end) as shown to be the case in humans (Drga et al., 2016). This tuned BrM neural feedback response is adaptable, dependent on the auditory input and internal processing. In an example, the time constants (τon and τoff) associated with the BrM neural inspired feedback are dependent on the auditory input.
Frequency “tuning” of the neural feedback may be dependent on the strength of the feedback (and by association gain and compression modulation) as well as optimal parameters of the time-constants associated with the feedback. The neural feedback time course may comprise a range of time-constants (dependent on the audio input); values derived from either/both physiological data and human psychophysical data. The values of gain and compression (dependent on the audio input) and their modulation by the neural feedback will be modelled on human data. Published and unpublished datasets are used to model the front-end components of the processor. Yasin et al., (2014) have published methodologies that can be used in humans to estimate the time constants associated with this neurofeedback loop (these studies use unmodulated and unmodulated noise with a range of neural feedback time constants to estimate speech recognition in noise.
Modelled features of this data set (published (Yasin et al., 2014, Drga et al., 2016; Yasin et al., 2018; 2020) and unpublished) are used in the audio-signal processor to alter parameters of gain, compression and neural tuning, dependent on the time-constant of the neural feedback. Unpublished datasets (Yasin et al.) using modulated sounds (more representative of the external sounds and speech most often encountered) will also be used (providing a wider range of time constants associated with the neural feedback) to further enhance speech in noise. The modified gain values are a continuum of gain values, example values of which have already been described herein.
In response to the received NIFS, the frontend unit 120 may modify a bandwidth of one or more of each of the bandpass filters in the array of overlapping bandpass filters comprising the filterbank 121. In this way, the frontend unit 120 performs a process of “filter tuning “associated with the change in gain”. The modified gain and compression values are either applied to the input audio signal 111 per bandpass filter in the array of overlapping bandpass filters. Alternatively, the modified gain and compression values are applied to the input audio signal 111 across some or all bandpass filters in the array of overlapping bandpass filters of the filterbank 121, within the frontend unit 120. The front-end gain and compression parameters are modelled using human-derived data in response to steady-state and modulated sounds.
The decision device 151 receives data 154 from the HLAI processing module 140 and stores it on its internal memory 152. The data 154 may include the human-derived brain-processing information 144 as previously described. The data 154 may instead include anyone, or all, of the data that is stored on the internal memory 142 as previously described. For example, the data 154 may include any or all of the following: the human-derived brain-processing information 144 (with which to derive the simulated BrM feedback information and/or the simulated CrM neural feedback information), the derived or determined NIFS, the filtered output audio signal 112, and the sound feature onsets 113. As shown by the double-headed arrows in
The SNR estimator module 160 is in data communication with the frontend unit 120 and the HLAI processing module 140. The SNR estimator module 160 receives the filtered output audio signal 112 from the frontend unit 120 and determines a signal-to-noise ratio (SNR) values 116 of the filtered output audio signal 112. This determined SNR values 116 is a signal-to-noise ratio of the mixture of the signal-of-interest and the background noise plus parameters associated with the estimation. In one example, the SNR estimator module 160 uses a changing temporal window to determine an ongoing estimate of the signal-to-noise ratio (SNR) values 116 of the filtered output audio signal 112. The SNR estimator module 160 then relays the determined SNR values 116 to the HLAI processing module 140, and the HLAI processing module 140 stores the SNR values 116 on its memory 142 for determining the NIFS.
As shown by the double-headed arrows in
The FE module 170 is in data communication with the SFOD 130, the SNR estimator module 160, and the MLU 150. The SNR estimator module 160 relays the filtered output audio signal 112 to the FE module 170. The FE module 170 then performs feature extractions on the filtered output audio signal 112 received from the SNR estimator module 160 in order to derive extracted features 117. The FE module 170 then relays extracted features 117 to the MLU 150, and the decision device 151 is configured to store the extracted features in its internal memory 152.
Alternatively, or additionally, as is shown in
In another embodiment, the HLAI processing module 140 modifies the human derived NIFS using said time constants in response to the received the data 154 (which at least includes: the filtered output audio signal 112, the sound feature onsets 113, and the SNR values 116) and relays said NIFS to the frontend unit 120.
The decision device 151 of the MLU 150 processes the data 154 received from, or exchanged between, the HLAI processing module 140. The data 154 includes the SNR values 116, which is readily exchanged between the HLAI processing module 140 and the SNR estimator module 160. In addition to the SNR values 116, the MLU 150 processes the extracted features 117 received directly from the FE module 170 and outputs a speech-enhanced filtered output audio signal 114. The decision device 151 also takes into account information regarding sound feature onsets and attentional oscillations that can be used to improve detection
In summary of example working steps of the audio-processor 400 shown in
The MLU 150 includes a machine learning algorithm (not shown in the figures) that is stored on an internal memory, such as being stored on the internal memory 152 of the decision device 151. The decision device 151 applies an output of the algorithm to the data 154 received from the HLAI processing module 140 (which at least includes the SNR values 116) with the extracted features 117 received directly from the FE module 170 and derives neural-inspired feedback parameters. In this way, the SNR values 116 are combined with extracted features 117 and used to estimate appropriate neural-inspired feedback parameters using the MLU 150. The MLU 150 and/or the machine learning algorithm may be referred to as a “Semi-Supervised Deep Neural Network” or a “SSDNN”. The SSDNN incorporates input from the HLAI to enhance speech detection in noisy backgrounds. For example, the decision device 151 uses inputs from the HLAI, the extracted features 117, the SNR values 116 and the SSDNN to optimise speech recognition in noise.
The machine learning algorithm encompasses a combination of both supervised and unsupervised learning using distributed embedded learning frames. The machine learning algorithm uses feature extraction information (i.e. the extracted features 117 directly received from the FE module 170) and the SNR values 116 contained in the data 154 (e.g. as received from the HLAI processing module 140) and “learns” dependencies between the signal and feature extraction. In other words, the audio-signal processor uses input from the HLAI processing module 140 to predict optimal Higher-Level Auditory Information (HLAI) over time. In other words, the SSDNN will “learn” over time (e.g. trained with a speech corpus with/without noise) and exposure to varied acoustic environments, the optimal parameters for speech enhancement in noise for a user. For example, the decision device 151 uses inputs from the HLAI, such as those contained within 154, extracted features 117, SNR values 116 and the SSDNN to optimise speech recognition in noise. The audio-signal processor parameters are optimised by the SSDNN over time. Measurements of brain derived HLAI will feed into a decision device 151 and aspects of the neural-inspired feedback process of the model.
The filtered output audio signal 112 may be analysed simultaneously by the SNR estimator module 160 to estimate the SNR values 116 and the FE module 170 to estimate the extracted features 117. Alternatively, the filtered output audio signal 112 may be analysed by the SNR estimator module 160 to estimate the SNR values 116 prior to the FE module 170, which is later used to estimate the extracted features 117.
The machine learning algorithm-derived neural-inspired feedback parameters are relayed, from the decision device 151 to the HLAI processing module 140 as part of the data 154 readily exchanged between the HLAI processing module 140 and the decision device 151 and/or the MLU 150. The HLAI processing module 140 then stores the neural-inspired feedback parameters on its memory 142 for determining the NIFS.
As an example, the decision device 151, within the SSDNN architecture of the signal processor 400, will incorporate oscillatory input information reflecting cortical and/or brainstem-level changes estimated from incoming stimulus-onset information that may be captured within 144 as well as elements of 130. The decision device 151 in conjunction with the SNR estimator module 160, via the HLAI 140, will inform optimal neural feedback parameter selection. Two-way interchange of information between 140 and decision device 151 will allow for further optimisation during a “training phase” of the SSDNN. In one example, the decision device 151 may combine information about CrM neural feedback (e.g. attentional processing, including attentional oscillations that can improve detection performance) obtained from human data and/or directly from brain from sensors placed in/or around the ear as captured by 144.
The previously described SSDNN (or deep-neural network) and an incorporated decision device 151 are used to select the most appropriate temporal features, aspects of neural feedback, and noise/speech parameters to optimise speech enhancement. The combined inputs of feature extraction and SNR are used to feed into the machine-learning component of the model. SNR is estimated from the incoming speech and noise mixture, and used to select the appropriate feedback time constant for optimising speech enhancement in noise. To accomplish the SNR estimation the following published and unpublished datasets are used. Yasin et al., (2018) have published some of the relationships between the SNR and speech-recognition performance in steady-state noise using an alternative computational model. SNR-speech recognition performance functions, derived for both steady-state noise and a range of modulated noise are used to optimise performance of the (NIIAS) audio-signal processor 100, 200, 300, 400.
Yasin et al (2018; 2020), have published data showing how different neural time constants can be used to improve speech recognition in noise. The feasibility of using different neural-inspired time-constants for improved speech in noise (for differing background noise) has been demonstrated using a simple model, as shown in
Number | Date | Country | Kind |
---|---|---|---|
2115950.4 | Nov 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/080302 | 10/28/2022 | WO |