The present disclosure relates to methods for performing personalized bandwidth extension on an audio signal, and related audio devices configured for carrying out the methods.
Bandwidth extension of signals is a well-known technique used in expanding the frequency range of a signal. Bandwidth extension is a solution often used to generate the missing content of a signal or to restore deteriorated content of a signal. The missing or deteriorated content may occur as the result of a communication channel, signal processing, background noise or jammer signals.
Audio codecs is one place where bandwidth extension is utilized. For example, when an audio signal is transmitted from a far-end station the audio signal may be encoded to a limited bandwidth to save bandwidth over the transmission channel, and at the near-end station, bandwidth extension is utilized to bandwidth extend the received encoded signal.
A purpose of bandwidth extension is to improve the perceived sound quality for the end user. It may also be used to generate new content to replace parts of a signal dominated by noise, thus providing for a certain level of denoising.
Most implementations of previously presented methods for bandwidth extension such as spectral band replication (SBR) or the approach used in the G.729.1 codec uses a generalized approach, where a one size fits all mentality is employed. Such generalized approach may lead to a sub-optimal user experience. Attempts have been made to arrive at a more personalized bandwidth extension model.
WO 2014126933 A1 discloses a personalized (i.e., speaker-derivable) bandwidth extension in which the model used for bandwidth extension is personalized (e.g., tailored) to each specific user. A training phase is performed to generate a bandwidth extension model that is personalized to a user. The model may be subsequently used in a bandwidth extension phase during a phone call involving the user. The bandwidth extension phase, using the personalized bandwidth extension model, will be activated when a higher band (e.g., wideband) is not available and the call is taking place on a lower band (e.g., narrowband).
WO 20211207131 A1 discloses an ear-wearable electronic device operable to apply a low-pass filter to a digitized voice signal to remove a high-frequency component and obtain a low-frequency component. Speech enhancement is applied to the low-frequency component. Blind bandwidth extension is applied to the enhanced low-frequency component to recover or synthesize an estimate of at least part of the high frequency component. An enhanced speech signal is output that is a combination of the enhanced low-frequency component and the bandwidth-extended high frequency component.
Larsen, Erik, Ronald M. Aarts, and Michael Danessis. “Efficient high-frequency bandwidth extension of music and speech.” Audio Engineering Society Convention 112. Audio Engineering Society, 2002, discloses an efficient algorithm for extending the bandwidth of an audio signal, with the goal to create a more natural sound. This is done by adding an extra octave at the high frequency part of the spectrum. The algorithm uses a non-linearity to generate the extended octave, and can be applied to music as well as speech. This also enables application to fixed or mobile communication systems.
However, even such a solution allows room for improvement in providing an optimal user experience.
Accordingly, there is a need for audio devices and associated methods with improved bandwidth extension.
According to a first aspect of the present disclosure there is provided a method for personalized bandwidth extension in an audio device, where the method comprises:
Hence, the proposed method provides a method for bandwidth extending an audio signal with the user of the audio device in mind. Such a solution provides a more personalized solution which caters to the person who needs to listen to the audio signal, and thus allows for optimizing the perceived sound quality with regards to the user of the audio device. Furthermore, such a solution may also optimize the use of processing power as processing power is not wasted on information, which is irrelevant for the user, e.g., wasting processing power by generating perceptually irrelevant information.
In an embodiment, the audio device is configured to be worn by a user. The audio device may be arranged at the user's ear, on the user's ear, over the user's ear, in the user's ear, in the user's ear canal, behind the user's ear and/or in the user's concha, i.e., the audio device is configured to be worn in, on, over and/or at the user's ear. The user may wear two audio devices, one audio device at each ear. The two audio devices may be connected, such as wirelessly connected and/or connected by wires, such as a binaural hearing aid system.
The audio device may be a hearable such as a headset, headphone, earphone, earbud, hearing aid, a personal sound amplification product (PSAP), an over-the-counter (OTC) audio device, a hearing protection device, a one-size-fits-all audio device, a custom audio device or another head-wearable audio device. The audio device may be a speakerphone or a soundbar. Audio devices can include both prescription devices and non-prescription devices.
The audio device may be embodied in various housing styles or form factors.
Some of these form factors are earbuds, on the ear headphones or over the ear headphones. The person skilled in the art is aware of different kinds of audio devices and of different options for arranging the audio device in, on, over and/or at the ear of the audio device wearer. The audio device (or pair of audio devices) may be custom fitted, standard fitted, open fitted and/or occlusive fitted.
In an embodiment, the audio device may comprise one or more input transducers. The one or more input transducers may comprise one or more microphones. The one or more input transducers may comprise one or more vibration sensors configured for detecting bone vibration. The one or more input transducer(s) may be configured for converting an acoustic signal into a first electric input signal. The first electric input signal may be an analogue signal. The first electric input signal may be a digital signal. The one or more input transducer(s) may be coupled to one or more analogue-to-digital converter(s) configured for converting the analogue first input signal into a digital first input signal.
In an embodiment, the audio device may comprise one or more antenna(s) configured for wireless communication. The one or more antenna(s) may comprise an electric antenna. The electric antenna may be configured for wireless communication at a first frequency. The first frequency may be above 800 MHz, preferably a wavelength between 900 MHz and 6 GHz. The first frequency may be 902 MHz to 928 MHz. The first frequency may be 2.4 to 2.5 GHz. The first frequency may be 5.725 GHz to 5.875 GHz. The one or more antenna(s) may comprise a magnetic antenna. The magnetic antenna may comprise a magnetic core. The magnetic antenna may comprise a coil. The coil may be coiled around the magnetic core. The magnetic antenna may be configured for wireless communication at a second frequency. The second frequency may be below 100 MHz. The second frequency may be between 9 MHz and 15 MHz.
In an embodiment, the audio device may comprise one or more wireless communication unit(s). The one or more wireless communication unit(s) may comprise one or more wireless receiver(s), one or more wireless transmitter(s), one or more transmitter-receiver pair(s) and/or one or more transceiver(s). At least one of the one or more wireless communication unit(s) may be coupled to the one or more antenna(s). The wireless communication unit may be configured for converting a wireless signal received by at least one of the one or more antenna(s) into a second electric input signal. The audio device may be configured for wired/wireless audio communication, e.g., enabling the user to listen to media, such as music or radio and/or enabling the user to perform phone calls.
In an embodiment, the wireless signal may originate from one or more external source(s) and/or external devices, such as spouse microphone device(s), wireless audio transmitter(s), smart computer(s) and/or distributed microphone array(s) associated with a wireless transmitter. The wireless input signal(s) may origin from another audio device, e.g., as part of a binaural hearing system and/or from one or more accessory device(s), such as a smartphone and/or a smart watch.
In an embodiment, the audio device may include a processing unit. The processing unit may be configured for processing the first and/or second electric input signal(s). The processing may comprise compensating for a hearing loss of the user, i.e., apply frequency dependent gain to input signals in accordance with the user's frequency dependent hearing impairment. The processing may comprise performing feedback cancelation, echo cancellation, beamforming, tinnitus reduction/masking, noise reduction, noise cancellation, speech recognition, bass adjustment, treble adjustment and/or processing of user input.
The processing unit may be a processor, an integrated circuit, an application, functional module, etc. The processing unit may be implemented in a signal-processing chip or a printed circuit board (PCB). The processing unit may be configured to provide a first electric output signal based on the processing of the first and/or second electric input signal(s). The processing unit may be configured to provide a second electric output signal. The second electric output signal may be based on the processing of the first and/or second electric input signal(s).
In an embodiment, the audio device may comprise an output transducer. The output transducer may be coupled to the processing unit. The output transducer may be a loudspeaker. The output transducer may be configured for converting the first electric output signal into an acoustic output signal. The output transducer may be coupled to the processing unit via the magnetic antenna.
In an embodiment, the wireless communication unit may be configured for converting the second electric output signal into a wireless output signal. The wireless output signal may comprise synchronization data. The wireless communication unit may be configured for transmitting the wireless output signal via at least one of the one or more antennas.
In an embodiment, the audio device may comprise a digital-to-analogue converter configured to convert the first electric output signal, the second electric output signal and/or the wireless output signal into an analogue signal.
In an embodiment, the audio device may comprise a vent. A vent is a physical passageway such as a canal or tube primarily placed to offer pressure equalization across a housing placed in the ear such as an ITE audio device, an ITE unit of a BTE audio device, a CIC audio device, a RIE audio device, a RIC audio device, a MaRIE audio device or a dome tip/earmold. The vent may be a pressure vent with a small cross section area, which is preferably acoustically sealed. The vent may be an acoustic vent configured for occlusion cancellation. The vent may be an active vent enabling opening or closing of the vent during use of the audio device. The active vent may comprise a valve.
In an embodiment, the audio device may comprise a power source. The power source may comprise a battery providing a first voltage. The battery may be a rechargeable battery. The battery may be a replaceable battery. The power source may comprise a power management unit. The power management unit may be configured to convert the first voltage into a second voltage. The power source may comprise a charging coil. The charging coil may be provided by the magnetic antenna.
In an embodiment, the audio device may comprise a memory, including volatile and non-volatile forms of memory.
The audio device may be configured for audio communication, e.g., enabling the user to listen to media, such as music or radio, and/or enabling the user to perform phone calls.
The audio device may comprise one or more antennas for radio frequency communication. The one or more antennas may be configured for operation in ISM frequency band. One of the one or more antennas may be an electric antenna. One or the one or more antennas may be a magnetic induction coil antenna. Magnetic induction, or near-field magnetic induction (NFMI), typically provides communication, including transmission of voice, audio, and data, in a range of frequencies between 2 MHz and 15 MHz. At these frequencies, the electromagnetic radiation propagates through and around the human head and body without significant losses in the tissue.
The magnetic induction coil may be configured to operate at a frequency below 100 MHz, such as at below 30 MHz, such as below 15 MHz, during use. The magnetic induction coil may be configured to operate at a frequency range between 1 MHz and 100 MHz, such as between 1 MHz and 15 MHz, such as between 1 MHz and 30 MHz, such as between 5 MHz and 30 MHz, such as between 5 MHz and 15 MHz, such as between 10 MHz and 11 MHz, such as between 10.2 MHz and 11 MHz. The frequency may further include a range from 2 MHz to 30 MHz, such as from 2 MHz to 10 MHz, such as from 2 MHz to 10 MHz, such as from 5 MHz to 10 MHz, such as from 5 MHz to 7 MHz.
The electric antenna may be configured for operation at a frequency of at least 400 MHz, such as of at least 800 MHz, such as of at least 1 GHz, such as at a frequency between 1.5 GHz and 6 GHz, such as at a frequency between 1.5 GHz and 3 GHz such as at a frequency of 2.4 GHz. The antenna may be optimized for operation at a frequency of between 400 MHz and 6 GHz, such as between 400 MHz and 1 GHz, between 800 MHz and 1 GHz, between 800 MHz and 6 GHz, between 800 MHz and 3 GHz, etc. Thus, the electric antenna may be configured for operation in ISM frequency band. The electric antenna may be any antenna capable of operating at these frequencies, and the electric antenna may be a resonant antenna, such as monopole antenna, such as a dipole antenna, etc. The resonant antenna may have a length of λ/4±10% or any multiple thereof, A being the wavelength corresponding to the emitted electromagnetic field.
In the context of the present disclosure, the term personalized or personalizing is to be construed as something being done to cater to the user using the audio device, e.g., a user wearing a headset where audio being played through the headset is processed based on one or more characteristics of the user wearing the headset. A personalized bandwidth extension model may for example have defined an upper and/or lower perceivable threshold for the user, i.e., a threshold frequency for which the user will be able to perceive sound, such thresholds may then define the extent to which bandwidth extension is performed, e.g., if the user cannot perceive frequencies above 14 kHz there is no reason to bandwidth extend an incoming signal to 20 kHz, therefore a personalized bandwidth extension model may be limited to 14 kHz.
The input microphone signal may be obtained in a plurality of manners. The input microphone signal may be received from a far-end station. The input microphone signal may be retrieved from a local storage on the audio device.
The input microphone signal may be an audio signal recorded at a far-end station. The input microphone signal may be a TX signal recorded at another audio device, and subsequently transmitted to the audio device. The input microphone signal may be a media signal. A media signal may be a signal representative of a song or audio of a movie. The input microphone signal may be voice signal recorded during a phone call or another communication session between two or more parties. The input microphone signal may be a pre-recorded signal. The input microphone signal may be a signal obtained in real-time, e.g., the input microphone signal being part of an on-going phone conversation.
The input microphone signal having a first bandwidth is to be interpreted as the input microphone signal being fully or at least mostly represented within the first bandwidth, e.g., all user relevant audio content of the signal being present within the first bandwidth.
The first bandwidth may be a frequency range within which the input microphone signal is represented. The first bandwidth may be a narrow band, hence the input microphone signal being a narrow band signal. The first bandwidth may be a bandwidth of 300 Hz to 3.4 kHz, such a bandwidth is supported by several communication standards. The first bandwidth may be a bandwidth of 50 Hz to 7 kHz, also known as wideband. The first bandwidth may be a bandwidth of 50 Hz to 14 kHz, also known as super wideband. The first bandwidth may be a bandwidth of 50 Hz to 20 kHz, also known as full band. The first bandwidth may comprise a plurality of bandwidth ranges, e.g., the first bandwidth may comprise two bandwidth ranges 50 Hz to 1 kHz, and 2 kHz to 7 kHz.
The second bandwidth may be a broader bandwidth than the first bandwidth. The second bandwidth may be a narrower bandwidth than the first bandwidth. The second bandwidth may comprise a plurality of bandwidth ranges, e.g., if the user of the audio device has a notch hearing loss in the frequency range of 3 kHz to 6 kHz, the second bandwidth may then comprise two bandwidth ranges from 50 Hz to 3 kHz and 6 kHz to 7 kHz thereby providing a personalized bandwidth based on the hearing loss of the user of the audio device. The second bandwidth may be a bandwidth optimized for the user of the audio device for the given input microphone signal, based on the first user parameter. The second bandwidth may a bandwidth selected to optimize the audio quality for the user of audio device, based on the first user parameter. A manner to optimize the audio quality is to optimize an audio quality parameter of the input microphone signal, such as a MOS score or similar.
The first user parameter may be obtained by receiving one or more inputs from a user of the audio device. The first user parameter may be obtained by retrieving the first user parameter from a local storage on the audio device, such as a flash drive. The first user parameter may be obtained by retrieving the first user parameter from an online profile of the user, e.g., a user profile stored on a cloud.
The one or more characteristics of the user of the audio device may be related to a user's usage of the audio device, e.g., if the user prefer a high gain on bass or treble. The one or more characteristics of the user may be related to the user themselves, e.g., a hearing loss, physiological data, a wear style of the audio device, or other.
The bandwidth extension model is a model configured for generating an output signal with a second bandwidth, based on the input microphone signal with the first bandwidth. The bandwidth extension model may generate the output signal by generating spectral content to the input microphone signal, e.g., adding spectral content to the received input microphone signal. The bandwidth extension model may generate the output signal by generating spectral content based on the input microphone signal, e.g., fully generating a new signal based on the input microphone signal. The bandwidth extension model used by the audio device is personalized, i.e., determined based on the user of the audio device. The bandwidth extension model may be configured to generate spectral content based on the input microphone signal. The bandwidth extension model may be configured to generate spectral content, based on the first user parameter and the input microphone signal. The bandwidth extension model may be configured to generate spectral content to maximize perceptually relevant information (PRI), based on the first user parameter and the input microphone signal. PRI may for example be calculated based on the perceptual entropy, as outlined in D. Johnston, “Estimation of Perceptual Entropy Using Noise Masking Criteria,” Proc. Int. Conf. Audio Speech Signal Proc. (ICASSP), pp 2524-2527 (1988).
Thus, the bandwidth extension model may perform bandwidth extension to optimize the perceptual entropy of the input microphone signal for the user of the audio device. The bandwidth model may be configured to generate the output signal with a second bandwidth to thereby maximize perceptually relevant information (PRI) for the user of the audio device. The bandwidth extension model may be configured to generate spectral content based on the input microphone signal, the audible range, and levels of the user of the audio device. The audible range may be defined as one or more frequencies ranges within which the user of the audio device may be able to perceive an audio signal being played back, e.g., as a standard the audible range for a person with perfect hearing is generally defined as being from 20 Hz to 20 kHz, however, it has been found there is large individual variations due to different hearing losses. The audible levels of the user of the audio device may be defined by masking thresholds within an audio signal, where the masking thresholds defines masked and unmasked components within an audio signal. The audible levels may be defined within different frequency bins.
PRI and/or the audible range and levels for a user may be determined based on the first user parameter.
The bandwidth extension model may be determined by a mapping function, where the mapping function maps different first user parameters to different bandwidth extension models. The different bandwidth extension models may be pre-generated models. The mapping function may also take into consideration additional parameters, such as the first bandwidth of the input microphone signal. The bandwidth extension model may be determined/generated in real-time based on an obtained first user parameter. The bandwidth extension model may be stored locally on the audio device. The bandwidth extension model may be stored in a cloud location, where the audio device may retrieve the bandwidth extension model. A plurality of bandwidth extension models may be stored locally on the audio device or in a cloud location.
The output signal may be an audio signal to be played back to a user of the audio device. The output signal may be a signal subject to undergo further processing.
Generating the output signal may involve giving the input microphone signal as an input to the determined bandwidth extension model, where the output of the determined bandwidth extension model will be the output signal.
In an embodiment the first user parameter comprises physiological information regarding the user of the audio device, such as gender and/or age.
Several studies have shown that hearing loss is well correlated with physiological parameters, such as age and gender. Thus, by obtaining relatively simple information regarding a user of the hearing device a personalization of the bandwidth extension model may be performed based on such information. For example, based on the physiological information an estimation of the user's hearing profile may be made, which in turn may be used for determining the audible range and levels for the user and/or PRI. The audible levels may be determined based on the input microphone signal and the user's hearing profile. Physiological information regarding the user may be obtained by asking the user to input the information via an interface, such as a smart device communicatively connected to the audio device. The physiological information regarding the user may comprise demographic information.
In an embodiment the first user parameter comprises the result of a hearing test carried out on the user of the audio device.
Consequently, the bandwidth extension model may cater to the actual hearing profile of the user of the audio device. The result of the hearing test may for example be an audiogram.
The bandwidth extension model may be generated based on the hearing profile of the user of the audio device.
In an embodiment the step c. comprises:
The codebook may be stored locally or on a cloud storage. The codebook may be part of an audio codec used for transmitting the input microphone signal. The codebook stores a plurality of bandwidth extension models, each bandwidth extension model may be associated with one or more user parameters.
Comparing the first user parameter with the codebook may comprise comparing the first user parameter to the one or more user parameters associated with each bandwidth extension model, to thereby determine the one or more user parameters matching the most with the first user parameter, and subsequently selecting the bandwidth extension model associated with the one or more user parameters matching the most with the first user parameter.
The one or more user parameters may be physiological information, such as gender and/or age. The one or more user parameters may be hearing profiles, such as results of hearing tests, e.g., audiograms.
The plurality of bandwidth extension models comprised in the codebook may be predetermined bandwidth extension models, which have been generated based on the one or more user parameters. For example, one bandwidth extension model may be associated with being 30 years old, the associated bandwidth extension model may have been generated based on the average hearing profile of a person being 30 years old, e.g., by assessing the audible range and levels of a 30-year-old person.
In an embodiment the method comprises
The determined first bandwidth may be given to a mapping function together with the first user parameter, the mapping function may then map the determined first bandwidth and the first user parameter to a bandwidth extension model. Each pre-generated bandwidth extension model may be associated with different bandwidths, e.g., different bandwidth model may be configured for performing bandwidth extension for different input bandwidths.
The first bandwidth may be determined by a bandwidth detector. Bandwidth detectors are known within the field of signal processing, for example, the EVS codec utilizes bandwidth detectors, further, information may be found in M. Dietz et al. “Overview of the EVS codec architecture”, ICASSP 2015, pp. 5698-5702, and Audio Bandwidth Detection in EVS codec, Symposium on 3GPP Enhanced Voice Series (GlobalSIP), 2015. Another example of a bandwidth detector can be found in the LC3 codec, cf., Digital Enhanced Cordless Telecommunications (DECT); Low Complexity Communication Codec plus (LC3plus), Technical Specification, ETSI TS 103 634, 2021.
The determined first bandwidth may also be compared to a codebook comprising a plurality of bandwidth extension models, wherein the plurality of bandwidth extension models are grouped according to different bandwidths. The selection may then happen based on comparing the determined first bandwidths to the different groups of bandwidth extension model.
In an embodiment the bandwidth extension model defines a target bandwidth, and wherein the step d. comprises:
The target bandwidth may be determined based on an audible frequency range for the user of the audio device.
The neural network may be a general regression neural network (GRNN), a generative adversarial network (GAN), a convolutional neural network (CNN), etc.
The neural network may be trained to bandwidth extend an input microphone signal with a first bandwidth to a second bandwidth to maximize the amount of perceptually relevant information for the user of the audio device. The neural network and training of the neural network will be explained further in-depth in relation to the second aspect and the detailed description of the present disclosure.
In an embodiment the first user parameter is stored on a local storage of the audio device, and wherein the step b. comprises:
The user of the audio device may have a profile stored on the audio device, as part of creating the profile the user of the audio device may associate one or more first user parameters with the profile. Hence, when the user initiates the audio device the user may select their profile to thereby allow for personalized signal processing based on the selected profile.
In an embodiment the step a. comprises:
The input microphone signal may be encoded to optimize the usage of a bandwidth over a communication channel. The input microphone signal may be encoded in accordance with one or more audio codecs, e.g., MPEG-4 Audio, or Enhanced Voice Service (EVS).
In an embodiment the method comprises:
During the establishment of the communication connected with the far-end station a handshake procedure may be undertaken where information is exchanged between the near-end station and the far-end station to configure the communication channel. As part of the information exchange the first user parameter may be transmitted to the far-end station, thus, allowing for the far-end station to encode a transmitted signal with the first user parameter. When the first user parameter is encoded with the transmitted signal a decoder at the near-end side may utilize the first user parameter without having to receive the first user parameter from another source, such as a local storage or a cloud location.
According to a second aspect of the present disclosure, there is provided a computer-implemented method for training a bandwidth extension model for personalized bandwidth extension, wherein the method comprises:
The one or more first audio signals may be bandlimited audio data. The one or more audio signals which have been recorded in full band and subsequently been artificially bandlimited. The one or more audio signal data may be generated/recorded at different bandwidths, e.g., narrowband 4 kHz, wideband 8 kHz, super-wideband 12 kHz, or full band kHz. The one or more audio signal may have undergone different kinds of augmentation, such as adding one or more of the following: noise, room reverberation, simulated packet loss, or jammer speech.
The user hearing profile in the hearing dataset may be associated with physiological information, such as age or gender. The user hearing profile in the hearing dataset may be a hearing profile of the user of the audio device. The user hearing profile may be determined based on one or more tests carried out on the user of the audio device. The user hearing profile may be a generalized hearing profile associated with a certain age and/or gender. The hearing dataset may comprise one or more user profiles.
The perceptual loss may be determined in a plethora of manners. The perceptual loss may be understood as a loss function determining a perceptual loss. For example, the perceptual loss may be determined to maximize PRI. In the case of maximizing PRI, the bandwidth extension model would be trained to generate spectral content to maximize the PRI measure. The PRI would be calculated based on the user hearing profile. Perceptual loss may be a perceptual loss function which promotes training of the model which results in increased PRI and punishes training resulting in lowering of the PRI.
In another approach a masking threshold and a personalized bandwidth is determined based on the hearing data set. The masking threshold and the personalized bandwidth may be used to determine the audible range and levels associated with the hearing dataset, where the personalized bandwidth may be determined as the audible range based on the user hearing profile, and the audible levels may be determined as masked or unmasked components based on the user hearing profile. The audible range and levels may be used in determining masked and unmasked components of the generated plurality of bandwidth extending audio signals. The perceptual loss may then be determined so to train the bandwidth extension model to generate spectral content which is audible within the audible range.
In the literature different loss function have been proposed to consider psychoacoustics aspects. An example of such a loss function can be found in Kai Zhen, Mi Suk. Lee, Jongmo Sung, Seungkwon Beack and Minje Kim, “Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding,” in IEEE Signal Processing Letters, vol. 27, pp. 2159-2163, 2020. In the article they propose a perceptual weight vector in the loss function. In their proposed loss function (denoted by L), the perceptual weight vector (w) is defined based on the signal power spectral density (p) and the masked threshold (m) derived from psychoacoustic models. The loss function proposed is as follows
where f is the frequency index, xf and {circumflex over (x)}f are the f-th spectral magnitude component obtained from the spectral analysis of the input and output of the neural network, respectively, and X, {circumflex over (X)} are the target clean time-frequency spectrum, estimated from neural network time-frequency spectrum, respectively, and w denotes the perceptual weight vector which is derived from p and m is as follows:
It is intuitive from w that, if the signal's power is larger than m (p>m), then the model is enforced to recover this audible component.
The above is one manner of training of determining a perceptual loss, however, the perceptual loss may alternatively be determined by a perceptual loss function which promotes training of the bandwidth extension model resulting in increased unmasked components and punishes training resulting in increased masked components.
The perceptual loss may be determined by a plurality of different functions, such as linear, non-linear, log, piecewise, or exponential functions.
For the present invention, the loss function may in one embodiment only be applied within the audible range determined from the user hearing profile, furthermore, the masking may be determined from the user hearing profile, hence, personalizing the loss function based on the user hearing profile. Frequencies generated by the model outside the audible range determined from the user hearing profile may be discarded as irrelevant, and/or the model may be trained to punish the generation of frequencies outside the audible range.
Training of the bandwidth extension model may be carried out by modifying one or more parameters of the bandwidth extension model to minimize the perceptual loss, e.g., by minimizing/maximizing a loss function representing the perceptual loss. In the case of the bandwidth extension model comprising a neural network training may be performed by back propagation, such as by stochastic gradient descent aimed at minimizing/maximizing the loss function. Such back propagation will result in a set of trained weights in the neural network. The neural network could be a regression network or a generative network.
In a third aspect of the invention there is provided an audio device for personalized bandwidth extension, the audio device comprising a processor, and a memory storing instructions which when executed by the processor causes the processor to:
The above and other features and advantages of the present invention will become readily apparent to those skilled in the art by the following detailed description of example embodiments thereof with reference to the attached drawings, in which:
Various example embodiments and details are described hereinafter, with reference to the figures when relevant. It should be noted that the figures may or may not be drawn to scale and that elements of similar structures or functions are represented by like reference numerals throughout the figures. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the invention or as a limitation on the scope of the invention. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.
Referring initially to
Referring to
Referring to
Referring to
Referring to
Referring to
It may be appreciated that
It is to be noted that the word “comprising” does not necessarily exclude the presence of other elements or steps than those listed.
It is to be noted that the words “a” or “an” preceding an element do not exclude the presence of a plurality of such elements.
It should further be noted that any reference signs do not limit the scope of the claims, that the example embodiments may be implemented at least in part by means of both hardware and software, and that several “means”, “units” or “devices” may be represented by the same item of hardware.
The various example methods, devices, and systems described herein are described in the general context of method steps processes, which may be implemented in one aspect by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removable and non-removable storage devices including, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), compact discs (CDs), digital versatile discs (DVD), etc. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform specified tasks or implement specific abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
Although features have been shown and described, it will be understood that they are not intended to limit the claimed invention, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the scope of the claimed invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense. The claimed invention is intended to cover all alternatives, modifications, and equivalents.
Number | Date | Country | Kind |
---|---|---|---|
22182783.5 | Jul 2022 | EP | regional |